Solver
The solver runs the simulation, by advancing the solution in time.
Thanks to the preparatory work done by preproc, solver has minimal duties besides the raw time integration. Before starting time integration, solver only needs to:
- read the cube-specific files, and assign cube data to each MPI rank;
- compute the pointers which will be used in the time advancement;
- define the time advancement routines;
- complete the setup of the MPI communications;
- load the restart field, if appllicable.
After these preliminary steps are completed, the execution proper starts.
solver has a minimal default screen output, that is meant to be customized through the case-specific file (like specifics-channel.cpl and specifics-nose.cpl for the tutorial examples). The output is also stored in <caseDir>/runtime.dat.
The main program
The file solver.cpl is the main program of the solver. It contains most of the operations which are independent from a specific flow type: input and output, BCs, MPI communications, runtime computing of some flow statistics and GPU initialization have their dedicated files io.cpl, bc.cpl, comm.cpl and gpuInit.cpl in the folder modules-solver/.
Pointers
solver computes the solution only at grid points in the fluid (internal points).
For computational efficiency, arrays of pointers to the internal points are pre-computed. Multiple arrays are used:
- time advancement pointers: they store addresses of the collocation points where at least one of the velocity components is in the fluid region;
- pressure correction pointers: they store addresses of the collocation points where the pressure correction routine has to be performed; two arrays exist for the red-black points; The programming logic behind the identification of pressure correction points is complex, as pressure correction is not applied to internal points belonging to non-periodic BCs;
- BC pointers: they store addresses of the collocation points where BCs are applied; they are obtained from the boolean maps computed by
preproc, and are also used to compute flow rates.
Pointers are highly advantageous from a computational viewpoint, but introduce additional complexity when working in heterogeneous memory environments (e.g. CPU-GPU). Pointers computed on CPUs cannot be used on GPUs, and viceversa (unless managed memory is used). On the other hand, computing pointers is a sequential operation, and becomes very slow on GPU. Although computing pointers is a one-time operation, a hybrid approach is implemented:
- the time advancements pointers, which are simple to define, are computed sequentially on GPU asynchronously;
- the pressure correction and BC pointers, which are more complex to define, are computed on CPU; the same holds for the interaction with complex data structures as the boolean maps. Pointers are then subtracted of an arbitrary base pointer inside the data structure to switch from addresses to byte shifts, and then copied to GPU. Once on GPU, the byte shifts are summed again to another base pointer, restoring their pointer nature. This last operation is performed in parallel on GPU;
- synchronization is imposed.
Input-output operations
The file io.cpl contains the procedure which distributes cubes among ranks; the routines responsible for the read-write operations of the different data structures can be found in modules-common/routines.cpl. Both are also USEd by postproc.cpl.
Loading cube files involves multiple open/close operations on them, to reduce file system load at the expense of a slight increase in execution times; data consistency is checked at every step. The loading procedure is as follows:
- the shared datafile
general-info.datis read by all the MPI ranks (an error is triggered whenever the number of ranks exceedsnTotCubes); - the array of cube IDs is subdivided randomly among the available ranks, such that each rank has (approximately) the same number of cubes;
- cube locality (i.e. which rank loads which cube) is established, and the information is shared among ranks, to allow writing the connectivity information needed for the MPI communications (an error is triggered whenever locality is not correct);
- the connectivity of each cube (i.e. for each cube, the rank of each neighboring cube) is established;
- the number of points contained in each cube is read (an error is triggered whenever cubes have different number of points);
- the IB coefficients and boolean maps for non-periodic BCs for each cube are loaded.
The random distribution of cubes among ranks already provides a relatively balanced load. For a proper (static) load balancing, the only step left is the solution of a graph partitioning problem, as done for example by Jansson et al (2019). Once the graph is partitioned, by assigning the cube ID to each rank, the rest of the uploading procedure remains unchanged, and no moving of heavy data structures (such as the IB coefficients) is needed between ranks.
Communications
The program accepts domains decomposition in all three spatial directions; therefore, up to 26 communications may be performed by each cube (6 faces + 12 edges + 8 vertices). More than one cube can be assigned to each rank. Since multiple same-length buffers may happen to be exchanged by two ranks at the same time and in the same direction, buffers must be identified with MPI tags, built with the ID of the sending cube and the communication direction. Unique tags for each communication are defined as:
tag = (sndCubeID-1) x 26 + commNum -1
where sndCubeID is the ID of the sending cube, and commNum is an integer associated to a communication in a specific direction, from the sending cube (see table below):
| Faces | Direction |
|---|---|
| 1 | Xprev |
| 2 | Xnext |
| 3 | Yprev |
| 4 | Ynext |
| 5 | Zprev |
| 6 | Znext |
| Edges | Direction |
|---|---|
| 7 | XnextYnext |
| 8 | XprevYnext |
| 9 | XnextYprev |
| 10 | XprevYprev |
| 11 | YnextZnext |
| 12 | YprevZnext |
| 13 | YnextZprev |
| 14 | YprevZprev |
| 15 | XnextZnext |
| 16 | XnextZnext |
| 17 | XnextZprev |
| 18 | XprevZprev |
| Vertices | Direction |
|---|---|
| 19 | XnextYnextZprev |
| 20 | XprevYnextZprev |
| 21 | XnextYprevZprev |
| 22 | XprevYprevZprev |
| 23 | XnextYnextZnext |
| 24 | XprevYnextZnext |
| 25 | XnextYprevZnext |
| 26 | XprevYprevZnext |
Therefore, for the connectivity between ranks to be fully established, for each communication direction and cube the following information must be known:
- ID of the receiving cube where the buffer is sent;
- MPI rank where the receiving cube is loaded;
- MPI tag for the send buffer;
- MPI tag for the receive buffer.
The MPI communications are currently implemented with a procedure that involves three copy steps:
- data for a given communication are copied to the corresponding send buffer;
- the
MPI_Isend/MPI_Irecvcall is posted, and the program then waits until all the ranks are synchronized before continuing; - data are copied from the receive buffer to the main data structure.
In alternative, a shortcut for communications between cubes belonging to the same rank is available. The shortcut not only saves one copy in the procedure, with better performance and less MPI overhead, but is also needed due to the dual CPU-GPU nature of the program.
Indeed, GPU-aware MPI allows direct GPU-GPU copies between addresses belonging to different devices, but not between addresses of the same device: MPI_Send/MPI_Recv calls must be avoided in this case. Finally, it is important to recall that AMPHIBIOUS assumes the use of GPU-aware MPI to work with multiple GPUs.
The currently implemented MPI communications are not optimized yet. Some future steps (in order of increasing complexity) towards improved efficiency are:
- for GPU, enable the asynchronous execution of kernels used to copy data from the data structure to the send/receive buffers;
- post the
MPI_Irecvcall before the routine after which the correspondingMPI_Isendcall is, to overlap computations and communications; - write a second communications routine to execute red-black communications for the pressure iterative solver which contains most of the communications performed in the timestep;
- skip the usage of sending buffer by using
MPI_Datatypearrays (becomes a two-copies procedure for both in-rank and out-of-rank communications); - pack buffers of different cubes which are exchanged between two ranks to avoid excessive communication startup overhead;
- use MPI topology, together with static load balancing, to optimize the distribution of cubes between ranks in the same node;
- use one-sided operations such as
MPI_Put/GetandMPI_Win_allocate` to skip rank synchronization overhead (this solution has been implemented but currently it does not work on GPU for MPI implementation which rely on UCX for network communications).
Boundary conditions
Implementing a new BC is a straightforward process. One should only pay attention to the dependence upon the (signed) direction of the normal of the boundary.
A new BC should be added within the CASE structure used for routines inletBC, outletBC or movingWallBC in the file modules-solver/bc.cpl.
The list of boundary conditions can be easily extended in the case of 1-point stencils. Less trivial is the extension to larger stencils, as the domain decomposition exposes the risk of segmentation fault if the BC's surface falls in an unlucky spot. In this last case, there is no other option except changing the domain decomposition. At the moment no check is implemented, so beware of segmentation faults.
Except for the no-slip (w) condition, enforced through the IB coefficients, all the BCs must be imposed on planar surfaces. The BC planes are aligned with pressure points in the staggered grid. When the coordinate of a BC plane falls in-between two pressure grid points, it is applied to the nearest one.
Coordinates of points where BC is to be applied are saved in arrays of CPL pointers into the main data structure. These pointers are centered on the indices of the data structure which store the collocation points of the pressure on the surface where the BC is applied.
The figure below shows that, owing to the staggered grid, the implementation of BC changes when applied on different components of the velocity and on surfaces with different normal sign and direction.
Figure: example of explicit BCs applied at and . Green arrows are velocity points where equations are advanced in time, blue dots are pressure points. Red arrows are ghost velocity points where the BC is imposed. Black arrows and dots are points not involved in the computations. Orange arrows indicate the boundary (signed) normal. Dotted squares are grid points stored in the pointers arrays (cyan for positive normal, magenta for negative one). Colored numbers represent the relative index of the grid points with respect to the plane where the corresponding BC is applied.
Data structures and memory requirements
solver manages multiple cubes, therefore each cube needs private data. These data are stored in arrays with nCubes elements of the data structure needed by each cube.
The main data structures are listed and briefly commented below. The outer dimension nCubes of the arrays is omitted. Dimensions are reported from the outer one to the inner ones.
Cube metadata
| Variable | Size | Description |
|---|---|---|
rankCubesIdx |
sizeof(int) |
Stores the IDs of cubes loaded by the MPI rank. |
cubeInfo |
14 x sizeof(int)+ 6 x sizeof(double) |
Stores general data of the cube: • Cube ID • Number of fluid points • Grid spacing (for uniform grids) • Position of the first ghost point ( p(0,0,0))• Number of points in the cube • Global index of the first point (1,1,1)• Indices of last fluid points • Cube position in global grid |
cubeConnectivity |
4 x 26 x sizeof(int) |
Stores connectivity data for cube-to-cube communication: • Neighboring cube IDs • Loading rank of neighbors • MPI send tags • MPI receive tags |
Timestep
| Variable | Size | Description |
|---|---|---|
mainVariables |
7 x (Nx+2)x(Ny+2)x(Nz+2) x sizeof(double) |
Stores the seven main variables: p, u, v, w,uimbc, vimbc, wimbc at each grid point. |
fldPtCount |
sizeof(int) |
Stores the total number of cube points with at least one fluid collocation point (used in time advancement). |
temporaryVariables |
6 x MAX(fldPtCount) x sizeof(double) |
Stores support variables: utmp, vtmp, wtmp, newu, newv, neww needed for time advancement. |
fldPtIndex |
MAX(fldPtCount) x sizeof(double*) |
Stores pointers into mainVariables for points with at least one fluid collocation point. |
rbPtCount |
2 x sizeof(int) |
Stores the number of red and black cube points where pressure correction is applied. |
rbPtIndex |
2 x MAX(rbPtCount) x sizeof(double*) |
Stores pointers into mainVariables where pressurecorrection is applied. |
Boundary Conditions
| Variable | Size | Description |
|---|---|---|
cubeBCsNumber |
sizeof(int) |
Stores the number of boundary conditions (BCs) applied within the cube, excluding types p and w (i.e. periodic and no-slip) BCs. |
cubeBCinfo |
MAX(cubeBCsNumber) x 4 x sizeof(int) |
Stores general information about each BC: • BC ID • Index of the plane where the BC is applied • Number of points where the BC is applied inside the cube • Signed normal of the plane |
cubesBCsMaps |
MAX(cubeBCsNumber) x sizeof(double*) |
Stores pointers to boolean arrays. For a BC with normal , the pointer references an array of size (Ny+2) x (Nz+2), with Y indicating that the BC is applied at a point. The total number of Y is also stored in cubeBCinfo |
Grid
| Variable | Size | Description |
|---|---|---|
xCoords |
(Nx+2) x sizeof(double) |
Stores the coordinates of the cube’s velocity collocation points in the x direction. Similar arrays: yCoords, zCoords for y and z directions. |
xDiffs |
(Nx+2) x 6 x sizeof(double) |
Stores first and second derivative coefficients in the x direction. Similar arrays: yDiffs, zDiffs for y and z directions. |
Statistics
| Variable | Size | Description |
|---|---|---|
statsArray |
nStats x Nx x Ny x Nz x sizeof(double) |
stores the cube statistics. |