v2312: New and improved parallel operation
Coupled constraint boundary conditions, e.g. processor and cyclic, should be consistent with their internal field values. For example, on a processor patch the value field has the neighbouring processor cell value, i.e. it caches the other side cell value. Other coupled patch fields might have as value the interpolate of the local and neighbouring cell value. The rule of thumb is that any code modifying a cell value, e.g. gradient calculation, should make a call to correct the boundary conditions to perform the value update (equates to a halo-swap on processor boundary field). However, for ‘local’ operations, e.g. multiplication, this can sometimes be skipped if the boundary condition only has a value, and does not depend on cell values. Most 'normal' boundary conditions fall into this category.
For coupled boundary conditions it also applies to the processor patch fields but not to e.g. cyclic
, cyclicAMI
variants. This results in values of intermediate calculations not obeying the consistency constraint.
In this release we enforce an evaluation after local operations such that at any time the value of a constraint patch is up to date. Please note that this choice is still under investigation and likely to be updated in the future.
Coupled patchField consistency checking
Visual code checks identified that the wall-distance field returned by the meshWave
and meshWaveAddressing
methods were not parallel consistent. To help catch these type of errors an optional consistency check has been added, enabled using debug switches:
DebugSwitches
{
volScalarField::Boundary 1;
volVectorField::Boundary 1;
volSphericalTensorField::Boundary 1;
volSymmTensorField::Boundary 1;
volTensorField::Boundary 1;
areaScalarField::Boundary 1;
areaVectorField::Boundary 1;
areaSphericalTensorField::Boundary 1;
areaSymmTensorField::Boundary 1;
areaTensorField::Boundary 1;
}
A value of 1 enables the check and leads to a FatalError
when the check fails. This can be used to easily pinpoint problems, particularly in combination with FOAM_ABORT
to produce a stack trace:
FOAM_ABORT=true mpirun -np 2 simpleFoam -parallel
The debug switch is interpreted as a bitmap:
bit | value(2^bit) | effect |
---|---|---|
0 | 1 | add check for every local operation |
1 | 2 | print entry and exit to check |
2 | 4 | warning instead of fatalError |
The comparison tolerance is set to 0 by default. For processor halo-swaps there is no tolerance issue since the exact same operations are performed in exactly the same order. However, for ‘interpolating’ coupled boundary conditions, e.g. cyclic
, cyclicAMI,
slightly different truncation errors arise since the local operation, e.g. multiplication by a constant, is performed -after- the interpolation v.s. interpolation of the local operation result. In this case the tolerance can be overridden:
OptimisationSwitches
{
volScalarField::Boundary::tolerance 1e-10;
// .. and similar for all the other field types ..
}
Backwards compatibility
The new consistency operations will slightly change the behaviour of any case that uses a cyclic
or cyclicAMI
boundary condition or any non-trivial turbulence model using coupled boundary conditions.
Optionally, this behaviour can be reverted to the previous (inconsistent!) form by overriding the localConsistency
setting in etc/controlDict
:
OptimisationSwitches
{
//- Enable enforced consistency of constraint bcs after 'local' operations.
// Default is on. Set to 0/false to revert to <v2306 behaviour
localConsistency 0;
}
A simple test is any tutorial with a cyclicAMI
patch, e.g. the pipeCyclic
tutorial. With the above DebugSwitches
to enable the checking it runs ok, but when disabling the localConsistency
flag an inconsistency is detected:
[0] --> FOAM FATAL ERROR: (openfoam-2302 patch=230110)
[0] Field dev(symm(grad(U))) is not evaluated? On patch side1 type cyclicAMI : average of field = ...
Related issues
Source code
Merge request
- Merge request 628
The cyclicAMI
boundary condition implements an area-weighted interpolation from multiple neighbouring faces. These faces can be local, or reside on remote processors and therefore require parallel communications.
In previous releases each cyclicAMI evaluation or matrix contribution in the linear solver (in case of non-local neighbouring faces) was triggering its own set of communications and waited for these to finish before continuing to the next cyclicAMI or processor patch. In this release the procedure follows a similar path as the processor patches that starts all sends/receives, and a ‘consumption’ phase that uses the remote data to update local values. A typical boundary condition evaluation or linear solver update now takes the form:
- do all initEvaluate/initInterfaceMatrixUpdate (coupled boundaries only). For processor, cyclicA(C)MI this starts non-blocking sends/receives.
- wait for all communication to finish (or combine with below using polling (
v2306
nPollProcInterfaces
)) - do all evaluate/updateInterfaceMatrix. This uses the received data to calculate the contribution to the matrix solution.
By handling the communication from cyclicA(C)MI in exactly the same way as processor boundary conditions there is less chance of bottlenecks and hopefully better scaling. An additional optimisation is that the local send/receive buffers are allocated once and reused.
Source code
Merge request
- Merge request 641
Tutorial
- any case with
cyclicAMI
orcyclicACMI
The GAMG
solver, in addition to local agglomeration, can combine matrices across processors (processor agglomeration). This can be beneficial at larger core counts since it:
- lowers the number of cores solving the coarsest level - most of the global reductions happen the coarsest level; and
- increases the amount of implicitness for all operations, i.e. smoothing, preconditioning.
In this release the framework has been extended to allow processor agglomeration of all coupled boundary conditions, e.g. cyclicAMI and
cyclicACMI
.
As a test, a comparison was made between
- a single 40x10x1 block; and
- two 20x10x1 blocks coupled using
cyclicAMI.
Both cases were decomposed into 4, and using the GAMG solver in combination with the masterCoarsest
processor agglomerator where all matrices are combined onto the master(s)
solvers
{
p
{
solver GAMG;
processorAgglomerator masterCoarsest;
..
}
- single block (so no cyclicAMI, only processor faces):
nCells nInterfaces
Level nProcs avg max avg max
----- ------ --- --- --- ---
0 4 100 100 1.5 2
1 4 50 50 1.5 2
2 1 100 100 0 0
3 1 48 48 0 0
The number of boundaries (nInterfaces) becomes 0 as all processor faces become internal.
- two-block case (so cyclicAMI and processor faces):
nCells nInterfaces
Level nProcs avg max avg max
----- ------ --- --- --- ---
0 4 100 100 3 3
1 4 50 50 3 3
2 1 100 100 2 2
3 1 48 48 2 2
Here, the number of boundaries reduces from 3 to 2 since only the two cyclicAMI
are preserved.
Notes
cyclicA(C)MI
:- as all faces become local, the behaviour is reset to non-distributed i.e. operations are applied directly on provided fields without any additional copying.
- rotational transformations are not yet supported. This is not a fundamental limitation but requires additional rewriting of the stencils to take transformations into account.
processorCyclic
: (acyclic
with owner and neighbour cells on different processors) is not yet supported. This is treated as a normal processor boundary and will therefore lose any transformation. Note thatprocessorCyclic
can be avoided by using thepatches
constraint in decomposeParDict, e.g.
constraints
{
patches
{
//- Keep owner and neighbour on same processor for faces in patches
// (only makes sense for cyclic patches and cyclicAMI)
type preservePatches;
patches (cyclic);
}
}
- only
masterCoarsest
has been tested but the code should support any other processor-agglomeration method. - the limited testing to date has shown no benefit of processor agglomeration of
cyclicAMI
. It is only useful if bottlenecks, e.g. the number of global reductions or implicitness are the issue.
Source code
Merge request
- Merge request 645
New hostUncollated fileHandler. This uses the first core on each node to perform the I/O. It is equivalent to explicitly specifying cores using the ioRanks option.
Improved general support of the collated file format and corresponding adjustments to the redistributePar utility. With these changes, the collated format can be used for a wider range of workflows than previously possible.
Handling of dynamic code, e.g. codedFixedValue boundary condition, is now supported for distributed file systems. For these systems, the dynamically compiled libraries are automatically distributed to the other nodes.
The numberOfSubdomains entry in the decomposeParDict file is optional. If not specified, it is set to the current number of processors the job was started with. Note that this is not useful for some methods, e.g. hierarchical which requires a consistent number of subdivisions in the three coordinate directions.
Distributed roots are automatically copied from the master processor. With a hostfile containing multiple hosts it is now possible to automatically construct, e.g. processors5_XXX-YYY on the local or remote nodes:
mpirun -hostfile hostfile -np 5 ${FOAM_ETC}/openfoam redistributePar -parallel -fileHandler hostCollated -decompose
Note the use of the openfoam wrapper script to ensure that all nodes use the same OpenFOAM installation.
In previous versions using “include” files in combination with collated
could be very fragile when the file contents were treated as runtime-modifiable. The handling of watched files has now been updated to ensure proper correspondence across the processor ranks.