v2606: New and improved parallel operation

GAMG processor agglomeration : communication-aware multiple masters

TOP

In this release, procFaces has been extended to accept an optional nMasters (or nProcessorsPerMaster) setting. This enables similar behaviour to masterCoarsest by agglomerating processors at the coarsest level until only nMasters or fewer remain. With nMasters 1, the behaviour is identical to masterCoarsest since no inter-processor boundaries will remain.

Effect

Running the simpleFoam pitzDaily tutorial with the following configuration:

p
{
    solver                  GAMG;
    ..
    processorAgglomerator   procFaces;
    nMasters                4;
}

With the debug switch enabled:

DebugSwitches
{
    GAMGAgglomeration 1;
}
                              nCells       nFaces/nCells         nInterfaces    nIntFaces/nCells     profile
   Level  nProcs         avg     max         avg     max         avg     max         avg     max         avg
   -----  ------         ---     ---         ---     ---         ---     ---         ---     ---         ---
       0      16         764     770       1.926   1.927         3.5       5      0.1025  0.1299   2.044e+04
       1      16         381     385       1.964   2.143         3.5       5      0.1686  0.2158        8367
       2      16         189     192       2.292   2.774         3.5       5      0.2689   0.349        3062
       3      16          93      96       2.321   2.645         3.5       5      0.4439  0.5532        1028
       4      16          45      47        2.33   2.489         3.5       5      0.6251  0.8043         362
       5       4          89      90       2.497   2.596         1.5       2      0.2982  0.4432       902.5
       6       4          41      43       2.298   2.429         1.5       2      0.4132    0.65       272.2
Master-coarsest with compact masters

In this release, masters can be forced to be allocated compactly using the new compactMasters option:

p
{
    solver                  GAMG;
    ..
    processorAgglomerator   masterCoarsest;
    nMasters                4;
    compactMasters          true;
}

In a case with 16 processors and 4 masters, the default master allocation places masters on processors 0, 4, 8, and 12 (visible with the masterCoarsest debug switch):

  master  nProcs  procIDs
  0       4       (1 2 3)
  4       4       (5 6 7)
  8       4       (9 10 11)
  12      4       (13 14 15)

With compactMasters enabled, masters are assigned to the lowest-numbered processors:

  master  nProcs  procIDs
  0       4       (4 5 6)
  1       4       (7 8 9)
  2       4       (10 11 12)
  3       4       (13 14 15)
Effect

Without compact masters:

                              nCells       nFaces/nCells         nInterfaces    nIntFaces/nCells     profile
   Level  nProcs         avg     max         avg     max         avg     max         avg     max         avg
   -----  ------         ---     ---         ---     ---         ---     ---         ---     ---         ---
       0      16         764     770       1.926   1.927         3.5       5      0.1025  0.1299   2.044e+04
       1      16         381     385       1.964   2.143         3.5       5      0.1686  0.2158        8367
       2      16         189     192       2.292   2.774         3.5       5      0.2689   0.349        3062
       3      16          93      96       2.321   2.645         3.5       5      0.4439  0.5532        1028
       4      16          45      47        2.33   2.489         3.5       5      0.6251  0.8043         362
       5       4          89      90       2.536   2.596         1.5       2      0.2191  0.2889       949.5
       6       4          41      43        2.36   2.429         1.5       2      0.2897     0.4       291.2

With compact masters:

                              nCells       nFaces/nCells         nInterfaces    nIntFaces/nCells     profile
   Level  nProcs         avg     max         avg     max         avg     max         avg     max         avg
   -----  ------         ---     ---         ---     ---         ---     ---         ---     ---         ---
       0      16         764     770       1.926   1.927         3.5       5      0.1025  0.1299   2.044e+04
       1      16         381     385       1.964   2.143         3.5       5      0.1686  0.2158        8367
       2      16         189     192       2.292   2.774         3.5       5      0.2689   0.349        3062
       3      16          93      96       2.321   2.645         3.5       5      0.4439  0.5532        1028
       4      16          45      47        2.33   2.489         3.5       5      0.6251  0.8043         362
       5       4          89      90       2.319   2.433           3       3       0.653  0.8621       599.8
       6       4          41      43       2.029   2.116           3       3      0.9538   1.268       172.5

The compact masters configuration results in more inter-processor boundaries and a higher profile. For larger agglomerations, e.g. 64 cores, this effect diminishes. However, there is no guarantee that equal-sized clusters are produced; on large decompositions, clusters of unequal size may still create bottlenecks.

Source code

Tutorial

Merge request

Parallel

TOP

The Pstream interface to MPI includes several improvements in this release:

  • probeMessage(): simplified parameters for a regular MPI_Probe.
  • probeMessages(): probe and return sizes from multiple sources simultaneously.

The IPstream (input/receive stream) can now be called with a “receive from any” mode, and its receive buffer can also be released. This enables algorithms to use IPstream to probe and receive, then recover the buffer for forwarding or storage with delayed deserialisation.

Additional align, tell, seek, and other methods on the input and output Pstreams allow them to be used for composite output with rewriting, enabling more flexible aggregated data handling.

Reduced MPI overhead in field function objects

The function objects fieldExtent, fieldMinMax, and fieldStatistics have been updated to significantly reduce the overall number of MPI operations and lower memory overhead. Although these function objects are not in critical code paths, they were selected as initial candidates for assessing the types of hidden overheads that remain in the code. The key improvements are:

  • the intermediate volume field required for the mag() operation is avoided in most cases
  • bounding box reductions are now bundled together, resulting in exactly two MPI reductions instead of 2*(nPatches+1)
  • fieldMinMax and fieldStatistics now use a single MPI_AllGather instead of six separate ones

Further work in this area is expected as these types of hidden overheads become more noticeable with increasing core counts.

Specialised reduction

A specialised Foam::reduce for MinMax with sumOp has been added. This passes directly to the corresponding MPI reductions without any intermediate tree communication or serialisation/deserialisation overhead.