MPI_Type, MPI_Alltoallw, mpp_global_field update #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the work of @marshallward
This patch contains three new features for FMS: Support for MPI datatypes, an
MPI_Alltoallw interface, and modifications to mpp_global_field to use these
changes for select operations.
These changes were primarily made to improve stability of large (>4000
rank) MPI jobs under OpenMPI at NCI.
There are differences in the performance of mpp_global_field,
occasionally even very large differences, but there is no consistency
across various MPI libraries. One method will be faster in one library,
and slower in another, even across MPI versions. Generally, the
MPI_Alltoallw method showed improved performance on our system, but this
is not a universal result. We therefore introduce a flag to control
this feature.
The inclusion of MPI_Type support may also be seen as an opportunity to
introduce other new MPI features for other operations, e.g. halo
exchange.
Detailed changes are summarised below.
MPI data transfer type ("MPI_Type") support has been added to FMS. This is
done with the following features:
mpp_typederived type has been added, which manages the type detailsand hides the MPI internals from the model developer. Types are managed
inside of an internal linked list,
datatypes.Note: The name
mpp_typeis very similar to the preprocessor variableMPP_TYPE_and should possibly be renamed to something else, e.g.mpp_datatype.*mpp_type_createandmpp_type_freeare used to create and release thesetypes within the MPI library. These append and remove mpp_types from the
internal linked list, and include reference counters to manage duplicates.
A
mpp_bytetype is created as a module-level variable for defaultoperations.
NOTE: As the first element of the list, it also inadvertently provides
access to the rest of
datatypes, which is private, but there is probablysome ways to address this.*
A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall
interface.
An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has
been added. In addition to replacing the point-to-point operations with a
collective, it also eliminates the need to use the internal MPP stack.
Since MPI_Alltoallw requires that the input field by contiguous, it is only
enabled for data domains (i.e. compute + halo). This limitation can be
overcome, either by copying or more careful attention to layout, but it can
be addressed in a future patch.
This method is enabled in the
mpp_domains_nmlnamelist group, by settingthe
use_alltoallwflag to True.Provisional interfaces to SHMEM and serial ("nocomm") builds have been added,
although they are as yet untested and primarily meant as placeholders for now.
This patch also includes the following changes to support this work.
In
get_peset, the method used to generate MPI subcommunicators has beenchanged; specifically
MPI_Comm_createhas been replaced withMPI_Comm_create_group. The former is blocking over all ranks, while thelatter is only blocking over ranks in the subgroup.
This was done to accommodate IO domains of a single rank, usually due to
masking, which would result in no communication and cause a model hang.
It seems that more recent changes in FMS related to handling single-rank
communicators were made to avoid this particular scenario from happening, but
I still think that it's more correct to use
MPI_Comm_create_groupand haveleft the change.
This is an MPI 3.0 feature, so this might be an issue for older MPI
libraries.
Logical interfaces added to mpp_alltoall and mpp_alltoallv
Single-rank PE checks in mpp_alltoall were removed to prevent model hangs
with the subcommunicators.
NULL_PE checks have been added to the original point-to-point implementation
of mpp_global_field, although these may not be required anymore due to
changes in the subcommunicator implementation.
This work was by Nic Hannah, and may actually be part of an existing pull
request. (TODO: Check this!)
Timer events have been added to mpp_type_create and mpp_type_free, although
they are not yet initialized anywhere.
The diagnostic field count was increased from 150 to 250, to support the
current needs of researchers.