Overview

The numerical method

The solver uses the IB method presented by Luchini et al. (2025), whose simplicity and performance are key in achieving the computational speed AMPHIBIOUS is capable of. The spatial discretization is based on second-order finite differences. Equations are written in primitive variables; time integration is fully explicit and follows a fractional-step approach; pressure correction is computed iteratively. Temporal discretization uses a Runge-Kutta three-stage temporal scheme (but switching to alternative schemes is straightforward). The method is second-order accurate in space, no matter how complex the boundary; the temporal accuracy is preserved. The algorithm is simple, fast and extremely robust, thanks to the favorable stability properties inherited from the IB correction, applied implicitly both in space and time.

The key property of the IB method is that only the central point of the Laplacian stencil is corrected, which brings in substantial code simplifications. The grid is Cartesian; the three velocity components are defined at the cell faces, and the pressure is defined at the cell center, as shown in figure below. Arbitrary grid spacing is allowed.

staggered-grid Figure A solid body (gray background) with the staggered Cartesian grid. The collocation points for the velocity components in the x, y and z directions are drawn in green, red and blue. Dull colors indicate the fully internal/external points, and vivid colors indicate points where the IB correction is applied.

Computational domain and boundary conditions

The computational domain (i.e. where the Navier--Stokes equations are solved) is enclosed in a parallelepiped of dimensions L_x x L_y x L_z referred to as the bounding box. Its dimensions are either assigned in settings.in whenever a simple geometry is described analytically (as in the channel flow example), or computed automatically from the geometry itself, whenever STL files are used to describe the computational domain (as in the nose flow example).

Boundary conditions (BC) are applied on planar surfaces which must be parallel to the faces of the bounding box. The set of implemented BC is still quite limited; however, its extension is straightforward and requires little more than defining the new BC itself. The currently supported types of BC, identified by a letter in the input file, include:

type w: solid wall
type m: solid wall with uniform or travelling-wave-like motion in the wall-parallel dirctions
type p: periodic (must be applied on the whole face of the bounding box)
type i or o: total pressure inlet or outlet (inlet or outlet planes can be in the domain interior)

Parallelism and domain decomposition

Every program in AMPHIBIOUS can run in parallel with MPI, with maximum flexibility in terms of MPI ranks and domain decomposition in every direction. GPU computing is only supported by solver, where most of the computing effort is spent.

The entire domain is discretized by preproc on a grid made by Nx x Ny x Nz mesh points. It is also decomposed into nDecompCubes = nCubesX x nCubesY x nCubesZ cubic subdomains. Each cube possesses halo points used to exchange data with neighboring cubes. The strategy of decomposing the domain into cubes is inspired by Nakanashi & Kim (2004) and Jansson et al (2019), and is useful for large cases involving complex geometries.

preproc is constrained to build cubes containing the same number of grid points. This implies that cubes with points outside the bounding box may be generated. These extra points are considered as solid points, and their IB coefficients are computed accordingly; as for every solid point, they are not considered in the calculations performed by solver.

preproc writes its output to several cube-specific files in the subfolder cubes/, to be read later by solver. Each file, identified by a numeric ID corresponding to the cube and required to establish connectivity with other cubes, starts with an ASCII header. After the header, preproc writes (in binary) the IB coefficients of the cells of each cube, discriminating whether a point is internal (i.e. in the fluid) or external (in the solid). If a cube has zero internal points, it gets pruned from the list of cubes, the corresponding cube-specific file is not written, and the global connectivity map is updated accordingly. The actual number of cube files nTotCubes written by preproc may therefore be less than nDecompCubes, and varies with the input parameters nCubesX, nCubesY, nCubesZ. As of this moment, these parameters also determine the MPI preprocessor ranks (this constraint will be relieved soon). Therefore:

nDecompCubes is the total number of cubes after the initial domain decomposition;
nTotCubes <= nDecompCubes is the total number of cubes left after the pruning process;
nCubes <= nTotCubes is the total number of cubes loaded by one MPI rank.

Lastly, preproc writes the file general-info.dat, containing among other things the value of nTotCubes and the list of their IDs.

channel-decomposition Figure: plane channel flow example. The domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. No cubes are pruned as the fluid domain and the bounding box are coincident:nDecompCubes=nTotCubes=16. In execution different MPI ranks can be used, three examples are shown. Cubes are loaded by rank accordingly. Number in the cubes are their IDs. Colors are used to highlight the different cubes loaded by each rank.

nose-decomposition Figure: nose flow example. The domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. Then, the empty cubes 1,4,7,8 are pruned: nDecompCubes=16 but nTotCubes=12. In execution different MPI ranks can be used, three cases are shown. Cubes are loaded by ranks accordingly. Number in the cubes are their IDs. Colors are used to highlight the different cubes loaded by each rank.

The two figures above provide examples where the optimal number of cubes is chosen by following a different rationale. In the former example, the geometry is simple, the bounding box includes internal points only, and the number of cubes should be set with the maximum number of MPI ranks in mind. The domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. No cubes are pruned as the fluid domain and the bounding box are coincident: nDecompCubes=nTotCubes=16. In execution different MPI ranks can be used, three cases are reported. For the nose geometry, the number of cubes affects the pruned fraction of the computational volume. the domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. Then, empty cubes are pruned (in the example, cubes 1,4,7,8): nDecompCubes=16 but nTotCubes=12. In execution different MPI ranks can be used, three cases are reported.

The solver is built such that each MPI rank can be assigned one or more cubes. Therefore, the number of possible MPI ranks ranges from 1 to nTotCubes. When using just one MPI rank, all non-pruned cubes are loaded by the rank; instead, when nTotCubes ranks are used, each rank loads one cube. Since each cube may perform up to 26 communications, decomposing the domain in too many cubes may lead to a substantial overhead. A large number of cubes is therefore recommended only whenever a load-balancing advantage is achieved via pruning. In any case, since solver does not alter files written by preproc, it is possible to experiment towards finding the optimal number of ranks for solver.

In the example figures above, for the plane channel the domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. No cubes are pruned as the fluid domain and the bounding box are coincident: nDecompCubes=nTotCubes=16. In execution different MPI ranks can be used, three examples are reported. Cubes are loaded by rank accordingly. For the nose geometry, the domain is decomposed in nCubesX=nCubesY=4 cubes along the x and y directions and nCubesZ=1 in the z direction. Then, empty cubes are pruned (in the example, cubes 1,4,7,8): nDecompCubes=16 but nTotCubes=12. In execution different MPI ranks can be used, three cases are reported.

Running on GPUs

solver supports both CPU and GPU execution. GPU programming is implemented using OpenACC directives, which appear in the source code as special comment lines. When running on GPUs, all data structures are first copied to GPU, and then permanently reside there. The slow CPU--GPU communications are kept to a minimum: during a time step, such communications only involve single variables (for example the flow rate).

When running on multiple GPUs, GPU-aware MPI communications are used, and explicit CPU--GPU copying is once again avoided. Since the performance degradation caused by multiple MPI ranks working on the same GPU can be substantial (up to 2x slowdown measured), in the multi-GPU case the numer of MPI ranks must equal the number of GPUs, otherwise a warning is issued. As load balancing can be performed by properly assigning cubes to MPI ranks, there is no need to balance GPU load if MPI ranks are balanced.

To enable GPU execution, the macro USEGPU must be defined in the file headers.cpl. At compile time, the Makefile checks for the macro and compiles for the intended target. Also, Makefile should have the variable CC set to the GPU compiler (e.g. CC=nvc), and the variable CFLAGS properly set.

Input files

The information required to set up a calculation consists of two main items:

the file settings.in, containing information for both preproc and solver (this is the only file the user interacts with during the entire simulation process, and can also be used to control post-processing);
the geometrical description of the problem, that can be provided analytically, or through a set of STL files (which, for simple geometries, can be optionally created by hand). For the simple and popular case of the plane channel, the utility createChannelSTL is available to generate the 6 STL planes delimiting the computational domain.

The input file settings.in is organized in sections, which are only read when needed. Sections can be placed in any order inside the settings file. Variables inside a section, instead must be in the same order as they are READ FROM by the program. For the available code, the order is the one showed in thetwo examples found in the repository.
Physical quantities are expressed in dimensional SI units. The sections are as follows:

PREPROC:
sets the variables Nx, Ny, Nz (number of discretization points in each direction) and nCubesX, nCubesY, nCubesZ (number of cubes in each direction);
GRID_NON_UNIFORM:
whenever the Cartesian grid has non-uniform spacing, allows to choose the law of variation of the grid step, independently for each of the three directions;
GEOMETRY_ANALYTIC:
contains the description of a geometry given in analytic form (as in the channel flow example), in terms of box size and boundary conditions; in alternative, the section GEOMETRY_STL contains the list of STL file names which define the computational domain, together with the type of boundary condition possibly complemented by the numerical value to be assigned on the boundary;
SOLVER:
it sets fluid properties (e.g. the fluid viscosity nu), as well as parameters for the temporal discretization (e.g. the time step deltat), and for the management of the simulation (e.g. the maximum simulation time Tmax); time step can be fixed or computed after a CFL condition;
INPUT_OUTPUT:
manages the i/o, and controls whether a restart file is saver and/or used, as well as the creation of a database of flow fields. A set of first and second moments for velocity and pressure can also be computed. Since the full statistics can lead to large file sizes in three-dimensional cases, one can select a subset of statistics by filling the variable statsNames with the required fields: e.g. statsNames=um,p2 only accumulates the mean of the u velocity component and the variance of pressure fluctuations;
CHANNEL_FLOW:
it contains information specific for the channel flow case, as for example the enforced value of the flow rate / pressure gradient in the three directions;
VTK:
this is not read at runtime but used by the utility createVTK.cpl, which converts statistics and/or fields into the VTK format to enable visualization.

Building the program

The program contains compile-time customization options which can be selected by the user. These consist in:

several macros, which can be set in the file headers.cpl by uncommenting the related #define lines;
a flow-specific file, placed in the modules-solver/ folder, to be used whenever flow-specific operations in solver are required. To do so, the user must USE it in solver. In the two examples, the file specifics-channel.cpl manages the screen output at runtime and the forcing term, whereas specifics-nose.cpl only manages the screen output.

The following macros can be set:

Name	Affected program	Description
`GRID_NON_UNIFORM`	`preproc`	Enables non-uniform Cartesian grid in the three directions (default is uniform in the three directions)
`GEOMETRY_ANALYTIC`	`preproc`	Allows preprocessing of analytically-described geometries. Mutually exclusive with `GEOMETRY_STL`
`GEOMETRY_STL`	`preproc` `solver`	Allows preprocessing of STL-described geometries. Mutually exclusive with `GEOMETRY_ANALYTIC`
`USEGPU`	`solver`	Enables GPU execution and requires a C compiler with support OpenACC directives
`STATS_SIZE_DOUBLE`	`solver` `createVTK`	Computes statistics in double precision (default is single precision)

Flow-specific files can be used to customize solver. The user must USE the flow-specific file in solver. In the two examples, the file specifics-channel.cpl manages the screen output at runtime and the forcing term, whereas specifics-nose.cpl only contains some specific screen output. When needed, flow-specific input can be assigned in a dedicated section in the settings.in file.

Any ad hoc routine is defined as INLINEd. By doing so, solver variables can be accessed even when not defined yet (they must be defined when the function is called). The CPL deferred assignment operator == must be used. An example can be found in specifics-channel.cpl, where only a body force to drive the flow and a specialized video output are implemented.

Running a simulation

The two-steps process of executing a simulation is straightforward, and involves the sequential run of preproc and solver, after the required information has been set up in the settings.in file, including the preparation of the necessary STL files if the geometry is so described. We provide below the workflow for the two examples currently included in the repository, which are representative of these two categories. For each tutorial, a file named reference-output.dat is available: it is expected that running the cases without modifications will replicate the output. The channel flow example can be run with either analytical and STL geometry input. Slightly different screen output is expected between the two cases, because of the minor grid differences between them (see later). The current reference-output.dat is computed for analytical geometry.

Each program (e.g. preproc) is compiled as:

make preproc

and the default compiler is gcc for the CPU target and nvc for the GPU target. Compiler, as well as other compilation options, can be changed by editing the Makefile. During execution, data are read to/written from files residing in a user-defined folder, the case folder. Each program is executed as:

mpirun -n <N> <program> <caseDir>

where program is either ./preproc or ./solver, and caseDir is the case folder, where the settings.in file is located. All files and folders are relative to caseDir: preproc writes its output files in the sub-folder <caseDir>/cubes/, which is created anew when missing. When geometry is described via STL files, preproc expects to find them in <caseDir>/geometry/.

Channel flow example

When cloning the repository, the solver is set up to run the channel example with analytical description of the geometry. Compile the preprocessor with

make preproc

and run it with:

mpirun -n <nDecompCubes> ./preproc examples/channel

Then, the solver is compiled with

make solver

and run as:

mpirun -n <M> ./solver examples/channel

where M is the MPI rank, and can be any integer between 1 and nTotCubes, whose value is readable from preproc-log.dat. The output of solver must be identical to examples/nose/reference-output.dat.

To run the same test with geometry described by STL, 6 STL files, each representing a planar surface described by two triangles only, are needed to define the bounding box. These files can by written by hand; for a convenient alternative, in the folder utilities the little program createChannelSTL.cpl can be used to set the lengths of the box sides, and to write the required 6 STL files in the case folder. Compile the utility with:

make ChannelSTL

and execute it as:

./utilities/createChannelSTL examples/channel

to write the required STL files in the geometry/ subfolder of examples/channel.

With the STL files in place, the only thing left to do before running solver is switching from #define GEOMETRY_ANALYTIC to #define GEOMETRY_STL in the file headers.cpl.

A note on (a)dimensionalization

The code uses variables with physical dimensions, as it is not possible to define dimensionless parameters of general use. However, this may be disturbing when dealing with simple flows like channel or pipe flow, where a dimensionless description is easier and more compact, and reasoning in terms of a Reynolds number is more natural.

One can run the channel flow simulation via the settings.in file in (at least) two ways: constant pressure gradient (CPG) and constant flow rate (CFR). The latter case is an extension of the former. They are described at length in this paper.

CPG. The natural choice of Reynolds number is the friction Reynolds number $Re_\tau$ , built with the friction velocity $u_\tau$ , half the gap with $h$ and the fluid viscosity. One starts with a choice of the numerical value of $Re_\tau$ ; deciding the working fluid via the value of nu sets the value of the product $u_\tau h$ , so that picking a value for $h$ sets the value of $u_\tau$ . In settings.in, the input variable Lz must be set to the value $2h$ . Instead, the value of $u_\tau$ leads to the value of the input variable headx given by $u_\tau^2/(2h)$ .
CFR (currently not implemented). The natural choice of Reynolds number is the bulk Reynolds number $Re_b$ , built with the bulk velocity $U_b$ , half the gap with $h$ and the fluid viscosity. One starts with a choice of the numerical value of $Re_b$ ; deciding the working fluid via the value of nu sets the value of the product $U_b h$ , so that picking a value for $h$ sets the value of $U_b$ . In settings.in, the input variable Lz must be set to the value $2h$ . Instead, the value of $U_b$ leads to the value of the input variable meanu given by $2 h U_b$ .

Nose flow example

This case requires #define GEOMETRY_STL; the necessary STL files are provided and already in place. To execute:

in header.cpl: switch to #define GEOMETRY_STL;
in header.cpl: remove #define GRID_NON_UNIFORM;
in solver.cpl: at approximately line 30, switch to USE nose-specifics.cpl.

Now, compile the preprocessor with

make preproc

and run it with

mpirun -n <nDecompCubes> ./preproc examples/nose

Then, the solver is compiled with

make solver

and run as

mpirun -n <M> ./solver examples/nose

where N is the MPI rank, with 1 <= N <= nTotCubes. The value of nTotCubes can be read from preproc-log.dat. The output, in CPU serial execution, should be identical to examples/nose/reference-output.dat.

Working directory and datafile structure

preproc and solver read and write files only inside the case directory. Herein, files are organized in subfolders according to scope. The subfolders have constant names. The names of data files, instead, are as follows:

a fixed part, which informs of the content of the file (e.g. restart);
a variable part, which is unique for each cube, based on cube ID and the total number of cubes nDecompCubes;
the extension of the file (e.g. .vtk or .bin).

For example, by executing the program serially, the generated restart file is called restart.1of1.bin as only one cube is generated.

After the execution of preproc, solver and the VTK generation utility createVTK, which creates a VTK version of the computed statistics and database of flow fields, the case directory will show the following structure:

folder-structure Figure Complete structure of a case folder.

The cube-specific files are made by two parts:

a readable ASCII header, whose length of 8kB is set in header.cpl, contains data and metadata of the cube, e.g. cube connectivity for the cube-data files, or the snapshot timestamp in the restart files
a binary chunk, that contains a dump of the actual data being saved (e.g. IB coefficients, accumulated statistics, restart field, snapshots).

Binary data are written on file as they are represented in memory at runtime, i.e. SoA ordering (Structure of Arrays) in row-major order (as in C language: arrays are indexed with the most rapidly changing index last). The whole data structure is written to disk without discriminating between internal or external points. Not all data structures share the same size: IB coefficients, restart fields and snapshots data structures are defined also in the halo points of the cube, while statistics are not.

Performance

scaling

Figure Strong scaling on CPUs and GPUs (preliminary data).

Performance analysis is still preliminary, and certain parts of the code require substantial optimization. Nevertheless, the starting point in terms of raw computational performance are encouraging. Figure above illustrates the measured time, normalized for point and time (sub)step, for CPU and GPU execution, on a problems sized 1024³ (strong scaling). The output of the code in the two cases is identical to floating-point precision, and verified to not change with the MPI rank. The test hardware for GPU is the Leonardo Booster partition at CINECA. Each of the 3456 nodes has a single socket Intel Xeon Platinum 8358 with 32 cores, and four NVIDIA A100 GPUs. The node present 512 GB RAM and inter-node communications are serviced through NVIDIA Mellanox Infiniband with 200 Gbps and DragonFly+ topology. The GPUs in a node are interconnected through NVLink 3.0 in order to allow direct Peer-To-Peer memory transfers.

The CPU execution has a scalar speed of 55 x 10^-9 seconds per point/step on one core; efficiency is gradually affected until the cores of the first node get fully used. Afetr that, linear scaling is observed up to the maximum tested number of 1024 cores. The GPU execution has a single-GPU speed of 7.5 10^-10 seconds per point/step: a case of size 1024³ advances by one time step in 0.75 seconds. The multi-GPU scaling is approximately linear up to 32 GPUs, and then starts to drift gradually.