Skip to content

Overview

An overview of AMPHIBIOUS and its workflow is provided, and the two tutorial examples are described.

The numerical method

The DNS solver uses the IB method presented by Luchini et al. (2025); its simplicity and performance are key in achieving the computational speed AMPHIBIOUS is capable of. The spatial discretization is based on second-order finite differences. Equations are written in primitive variables; time integration is fully explicit (beware of low-Reynolds simulations!) and follows a fractional-step approach; pressure correction is computed iteratively. Temporal discretization uses a Runge-Kutta three-stage temporal scheme (but switching to alternative schemes is straightforward). The method is second-order accurate in space, regardless of the complexity of the boundary; the temporal accuracy is preserved. The algorithm is simple, fast and extremely robust, thanks to the favorable stability properties inherited from the IB correction, applied implicitly both in space and time.

The key property of the IB method is that only the central point of the Laplacian stencil is corrected, which brings in substantial code simplifications. The grid is Cartesian; the three velocity components are defined at the cell faces, and the pressure is defined at the cell center, as shown in figure below. Arbitrary grid spacing is allowed.

staggered-grid Figure A solid body (gray background) with the staggered Cartesian grid. The collocation points for the velocity components in the x, y and z directions are drawn in green, red and blue. Dull colors indicate the fully internal/external points, and vivid colors indicate points where the IB correction is applied.

Computational domain and boundary conditions

The computational domain (i.e. where the Navier--Stokes equations are solved) is enclosed in a parallelepiped of dimensions referred to as the bounding box. Its dimensions are either assigned via the input file settings.in whenever a simple geometry is described analytically (as in the channel flow example), or computed automatically from the geometry itself, whenever STL files are used to describe the computational domain (as in the nose flow example).

Boundary conditions (BC) are applied on the body or on planar surfaces which must be parallel to the faces of the bounding box. The set of implemented BC is minimal as of this moment; however, its extension is straightforward and requires little more than defining the new BC itself. The currently supported BC include the solid wall (possibly with a travelling-wave-like motion in the wall-parallel directions), periodic conditions, and total pressure at inlet or outlet. More details are provided in the dedicated Section.

Parallelism, domain decomposition and cube pruning

All programs in AMPHIBIOUS run in parallel with MPI, with maximum flexibility in terms of number of ranks and domain decomposition in every direction. GPU computing is only available for solver, the only part where a large computational effort is spent.

The entire domain is discretized by preproc on a grid made by Nx x Ny x Nz mesh points. It is also decomposed into nDecompCubes = nCubesX x nCubesY x nCubesZ cubic subdomains. Each cube possesses halo points used to exchange data with neighboring cubes. Decomposing the domain into cubes is useful for large cases involving complex geometries; the procedure is inspired by Nakahasi & Kim (2004) and Jansson et al (2019).

preproc is constrained to build cubes containing the same number of grid points. Therefore, cubes partially outside the bounding box may be generated. Their outer points are considered as solid points, and their IB coefficients are computed accordingly; as for every solid point, they are ignored in the calculations performed by solver.

A global file general-info.dat and several cube-specific files are writtten by preproc in the subfolder cubes/. The file general-info.dat contains the value of nTotCubes, the number of active cubes, as well as the list of their IDs. The value nTotCubes of cube files written by preproc may therefore be less than the total number nDecompCubes, and depends on the input parameters nCubesX, nCubesY, nCubesZ. These parameters also determine the MPI preprocessor ranks. Multiple cubes may then be loaded at once by each MPI rank of the solver. Therefore:

  • nDecompCubes is the total number of cubes after the initial domain decomposition;
  • nTotCubes <= nDecompCubes is the total number of cubes left after the pruning process;
  • nCubes <= nTotCubes is the total number of cubes loaded by one MPI rank.

Each cube-specific file, identified by its ID, starts with an ASCII header, followed by the immersed-boundary coefficients of the cells of the cube, and by information required to establish connectivity with other cubes. The IB coefficients allow discriminating whether a point is internal (i.e. in the fluid) or external (in the solid). If a cube has zero internal points, it gets pruned from the list of cubes, the corresponding cube-specific file is not written, and the global connectivity map is updated accordingly.

The two figures below provide examples where the optimal value for nDecompCubes is chosen by following a different rationale.

channel-decomposition channel-ranks Figure: plane channel flow example.
Top: the pruning process. The bounding box (left) is decomposed in nCubesX=nCubesY=nCubesZ=5 cubes (center); no cubes are pruned (right) as the fluid domain and the bounding box are coincident: nDecompCubes=nTotCubes=125. Bottom: cubes-ranks assignment, colors identify cubes loaded by each rank. At run time, different MPI ranks can be used, as for example 1 (left), 3 (center) or 5 (right).

nose-decomposition nose-ranks Figure: nose flow example.
Top: the pruning process. The bounding box (left) is decomposed in nCubesX=nCubesY=nCubesZ=5 cubes (center); empty cubes are pruned (right), so that nDecompCubes=125 but nTotCubes=91. Bottom: cubes-ranks assignment, colors identify cubes loaded by each rank. At run time, different MPI ranks can be used, as for example 1 (left), 3 (center) or 5 (right).

The solver is built such that each MPI rank can be assigned one or more cubes. Therefore, the number of possible MPI ranks ranges from 1 to nTotCubes. When using just one MPI rank, all non-pruned cubes are loaded by the rank; instead, when nTotCubes ranks are used, each rank loads one cube. Since each cube may perform up to 26 communications, an excessive number of cubes may lead to a substantial overhead. A large number of cubes is therefore recommended only whenever a load-balancing advantage or significant memory savings are achieved via pruning. In any case, since solver does not alter files written by preproc, it is possible to experiment towards finding the optimal number of ranks for solver.

In the channel example, the geometry is simple, the bounding box includes internal points only, and the number of cubes should be set with the maximum number of MPI ranks in mind. The domain is decomposed in nCubesX=nCubesY=nCubesZ=5. No cubes are pruned as the fluid domain and the bounding box are coincident: nDecompCubes=nTotCubes=125. For the nose example, the same domain decomposition is employed. Then, empty cubes are pruned: nDecompCubes=125 but nTotCubes=91. In execution different MPI ranks can be used, three examples are reported. Cubes are loaded by rank accordingly.

Running on GPUs

The compute-intensive solver supports both CPUs and GPUs. GPU programming is implemented using OpenACC directives, which appear in the source code as special comment lines. When running on GPUs, all data structures are copied to GPU at startup, and then permanently reside there. The slow CPU-GPU communications are kept to a minimum: during a time step, such communications only involve single global variables (like e.g. the flow rate). When running on multiple GPUs, GPU-aware MPI communications are used, and explicit CPU-GPU copying is once again avoided. Since the performance degradation caused by multiple MPI ranks working on the same GPU can be substantial (up to 2x slowdown measured), in the multi-GPU case the numer of MPI ranks must equal the number of GPUs (a warning is issued otherwise). Load balancing can be performed by properly assigning cubes to MPI ranks, therefore balancing GPU load is not required provided MPI ranks are balanced.

To enable GPU execution, the macro TARGET_IS_GPU must be defined in the file headers.cpl. At compile time, the Makefile uses the macro TARGET_IS_GPU to compile for the intended target.

How to set up a simulation

The information required for a simulation consists of:

  • a properly filled settings.in, which is the only file required to configure preproc, solver and postproc;
  • the geometrical description of the problem, that can be provided analytically, or through a set of STL files (which, for simple geometries, can be optionally created by hand). For the simple and popular case of the plane channel, the utility utilities/createChannelSTL is available to generate the 6 STL planes delimiting the computational box.

The input file settings.in is organized in independent sections, which are read on a when-needed basis. Sections can appear in arbitrary sequence inside the settings file. Variables inside a section, instead, must follow the same order as they are READ FROM by the program.

The sections are as follows:

  • PREPROC:
    sets the variables Nx, Ny, Nz (number of discretization points in each direction) and nCubesX, nCubesY, nCubesZ (number of cubes in each direction);

  • NONUNIFORM_GRID:
    whenever the Cartesian grid has non-uniform spacing, sets the law of variation of the grid step, independently for each of the three directions;

  • ANALYTIC_GEOMETRY:
    contains the description of a geometry given in analytic form (as in the channel flow example), in terms of box size, number of boundary planes and the corresponding boundary conditions. Alternatively, the section STL_GEOMETRY contains the number and names of the STL files which define the computational domain, each accompanied by its boundary condition;

  • SOLVER:
    it sets fluid properties (e.g. the fluid viscosity nu), as well as parameters for the temporal discretization (e.g. the time step dt), and for the management of the simulation (e.g. the maximum simulation time Tmax); time step can be constant or computed after a CFL condition. The variable nItPressureCorrection sets the number of iterations for the solution of the pressure equation. Physical quantities are expressed in dimensional SI units.

  • INPUT_OUTPUT:
    controls whether and how a restart file is saved and used, as well as the creation of a database of flow fields. A set of first and second moments for velocity and pressure can also be computed. Since the full statistics can lead to large file sizes, a subset of statistics can be selected by filling the variable statsNames with the required fields: e.g. statsNames=um,p2 only accumulates the mean of the velocity component and the variance of pressure fluctuations;

  • POSTPROC:
    converts binary data accumulated during the simulation into three-dimensional VTK files for flow field(s) (velocity and pressure) statistics, that can be saved in double or simple precision to save disk space. Averaging over homogeneous directions (when present) is also possible to get statistics in an ASCII file.

  • a flow-specific section, e.g. CHANNEL_FLOW:
    contains information for the specific flow case; for the channel flow, for example, it contains the enforced value of the flow rate / pressure gradient in the three Cartesian directions.

Boundary conditions

The boundary conditions are specified in the geometry section of settings.in. Currently, only few types of boundary conditions are implemented, and some are preliminary. However, the infrastructure allows for a simple extension and refinement. Five BC types are available: no-slip solid wall (w), periodic (p), in-plane moving wall (m), total pressure inlet (i) and total pressure outlet (o). In settings.in, the number of boundaries must be first specified (either as nFaces or nStlFiles for the analytic and STL descriptions, respectively); then, each boundary condition must be prescribed with one line of the type:

type=<bcType> <bcSpec1>=value <bcSpec2>=value ...

where the descriptors <bcType> and <bcSpecs> are defined according to the following table:

Boundary condition bcType bcSpecs Notes
No-slip solid wall w None Managed via the IB infrastructure
Periodic p None Must be prescribed in pairs (not checked)
Plane wall with prescribed wall-parallel velocity m vel1, vel2, strDistr1, a1, k1, omega1, strDistr2, a2, k2, omega2 None
Total pressure inlet i bcP None
Total pressure outlet o bcP None

With STL geometry, a boundary condition applies to one STL. With analytically described geometry, the sequence of boundaries of the bounding box starts with the surface at , then , and then again for the and directions.

No-slip solid wall

This is the standard solid wall, where no-penetration and no-slip boundary conditions are applied via the IB coefficients.

Periodic

Periodic BC must be applied on one (or more) pairs of whole faces of the bounding box. The correct pairing is not explicitly checked.

Plane wall with prescribed wall-parallel velocity

A plane wall can be given a non-zero velocity in the wall-parallel directions. The wall velocity can be spatially non-uniform, and time-dependent as well. The two wall-parallel directions in the Cartesian reference frame are referred to by the indices 1 and 2.

The descriptors vel1 and vel2 are constants that can be for example used to prescribe a Couette flow. Time- and space-dependency is limited to sinusoidal variations, with amplitude a1, a2, temporal frequency omega1, omega2 and spatial frequency (wavenumber) k1, k2. The boolean strDistr prescribes whether the sinusoid is streamwise or spanwise. Hence, for a plane wall with normal with strDistr2=YES, the expression: enforces for the spanwise 2 component a travelling wave along the streamwise direction.

Inlet: total pressure

At the inlet boundaries of the computational domain, total pressure is prescribed by assigning a value for the total pressure through the descriptor bcP. Pressure, the velocity component normal to the boundary and the two components and parallel to the boundary are prescribed as follows:

  • : zero gradient;

Outlet: total pressure

At the outlet boundaries of the computational domain, total pressure is prescribed by assigning a value for the total pressure through the descriptor bcP. Pressure and velocity components are prescribed in a way that depends on whether, locally, a backflow is present. At points with no backflow, the entire velocity has a zero-gradient condition, and for pressure . With backflow, pressure is instead set as .

Compilation and dependencies

AMPHIBIOUS is built to be as self-contained as possible, yet some dependencies exist. The current stable version has been tested with the following software:

  • the CPL compiler and its (few) dependencies (current version tested with 2025-11-02-cpl.sh)
  • a C compiler (current version tested with gcc v.12.2.0 and v.14.2.0) and the nvc compiler (current version tested with nvc v.23.11 and v.25.1)
  • MPI for parallel CPU computing (current version tested with openmpi v.4.1.6 and v.5.0.7 and hpcx-mpi v.2.19); GPU-aware support is required for multiGPU computing
  • OpenACC >=2.6 for parallel (multi)GPU computing

Programs and utilities are compiled with the provided Makefile. The default compiler is gcc for the CPU target and nvc for the GPU target. Compiler, as well as other compilation options, can be changed by editing the Makefile.

Setting up a simulation

The program contains compile-time customization options which can be selected by the user. These consist in:

  • several macros, which can be set in the file headers.cpl by uncommenting the related #define lines;

  • a flow-specific file, placed in the modules-solver/ folder, to be used whenever flow-specific operations in solver are required. To do so, the user must USE it in solver. In the two examples, the file specifics-channel.cpl manages the screen output at runtime and the forcing term, whereas specifics-nose.cpl only manages the screen output.

The following macros can be set:

Name Used by Description
GRID_IS_NONUNIFORM preproc Enables non-uniform Cartesian grid in the three directions
(default is uniform)
GEOMETRY_IS_ANALYTIC preproc Allows preprocessing of analytically-described geometries.
Mutually exclusive with GEOMETRY_IS_STL
GEOMETRY_IS_STL preproc
solver
Allows preprocessing of STL-described geometries.
Mutually exclusive with GEOMETRY_IS_ANALYTIC
TARGET_IS_GPU solver Sets GPU compile target, and requires a C compiler
with support for OpenACC directives
STATSIZE_IS_DOUBLE solver
postproc
Computes statistics in double precision
(default is single precision)

Flow-specific files can be used by solver to customize its behavior. In the tutorials, the file specifics-channel.cpl manages the runtime screen output and the forcing term, and includes initial conditions for laminar Poiseuille and Couette flows, as well as a laminar Stokes layer profile on one wall, with optional deterministic and/or random perturbations. In specifics-nose.cpl, instead, only the screen output is specified. When needed, flow-specific input can be assigned in a dedicated section in the settings.in file. Any ad hoc routine defined in the flow-specific files is INLINEd. By doing so, solver variables can be accessed even when still undefined (of course, they must be defined when the function is called). The CPL deferred assignment operator == must be used. An example can be found in specifics-channel.cpl.

Running a simulation

A typical workflow involves the sequential run of preproc, solver and possibly postproc, once the simulation is designed by preparing the settings.in file (and by providing STL files, when required).

We describe here how to run the two tutorials currently included in the AMPHIBIOUS repository, where both geometry types are used. For each tutorial, a file named reference-output.dat is available: it is expected that running the cases with settings.in unchanged will replicate the output. The channel flow example can be run with either analytical and STL geometry input.

During execution, data are read from/written to files residing in a single user-defined directory, the case directory caseDir. Each program is executed as:

mpirun -n <N> <program> <caseDir>

where <program> is ./preproc, ./solver or ./postproc, N is the MPI rank, and <caseDir> is where the settings.in file is located. All files and folders are relative to <caseDir>: preproc writes its output files in the sub-folder <caseDir>/cubes/, which is created anew when missing. When geometry is described via STL files, preproc expects to find them in <caseDir>/geometry/. A more in depth discussion about how files are saved in the working directory can be found in the dedicated Section.

Channel flow example

This example shows how to carry out the DNS of an indefinite turbulent plane channel flow, with periodic boundary conditions along the and wall-parallel directions. The setup has analytical description of the geometry, uses a 100^3 grid and nDecompCubes=2, and starts with a fictitious initial condition where a sinusoidal perturbation is superimposed to a power-law velocity profile. Note that such initial condition is not adequate to initiate a proper channel flow DNS. Also, the setup corresponds to a very large Reynolds number, and the simulation is highly under-resolved. The initial condition is still fluid.

Preprocessor

Compile the preprocessor with

make preproc

and run it with:

mpirun -n <N> ./preproc examples/channel

where the MPI rank <N> can be 1 or 2. The execution time is negligible.

Solver

Compile the solver with

make solver

and run as:

mpirun -n <N> ./solver examples/channel

where the value of the MPI rank N can be 1 or 2. Its maximum value is nTotCubes, whose value can can be read from preproc-log.dat. In settings.in, solver is configured to run a simple CFG simulation (see below) with prescribed time step, and to stop after 100 steps. Its (serial or parallel) output, recorded in examples/channel/runtimedata.dat must be identical to examples/channel/reference-output.dat.

Postproc

Compile the postprocessor with

make postproc

and run as:

mpirun -n <N> postproc examples/channel

where <N> can be 1 or 2, as for solver. The execution time is negligible. Depending on configuration in settings.in, postproc outputs binary files for statistics and/or flow fields, and converts them into .vtk legacy files for visualization. In this example, only two snapshots are saved, and the corresponding three-dimensional field for the mean longitudinal velocity um is produced. Moreover, a spatial average for um over the two homogeneous directions is carried out, and a smaller ASCII file with lower-dimensionality statistics is written. It can be conveniently plotted with the gnuplot utility, by prepending the command set key autotile columnhead to skip plotting the first line. Details on the output capabilities of postproc can be found in the dedicated Section.

Channel with STL

The same channel example can be run with geometry described by STL: 6 STL files are needed for the 6 planes defining the bounding box. Each can be described by two triangles only, and therefore can be written by hand. As a convenient alternative, in the folder utilities the utility createChannelSTL.cpl can be used to set the lengths of the box sides, and to write the required 6 STL files in the case folder. Compile the utility with:

make createChannelSTL

and execute it as:

./utilities/createChannelSTL examples/channel

to write the required STL files in examples/channel/geometry/.

With the STL files in place, the procedure above can be repeated. First, in headers.cpl #define GEOMETRY_IS_STL must be selected instead of #define GEOMETRY_IS_ANALYTIC. Then, everything must be recompiled and rerun, starting from preproc. Note that slight differences between screen output and reference-output.dat are expected: with STL, a ray-casting algorithm is used to discriminate interior and exterior points, leading to minor (roundoff) differences in the grid along the wall-normal directions.

A note on non-dimensionalization

The code uses variables with physical dimensions, as dimensionless parameters of general use cannot be defined. However, this may be disturbing when dealing with simple flows like channel or pipe flow, where a dimensionless description is easier and more compact, and reasoning in terms of a Reynolds number is more natural.

One can run the channel flow simulation via the settings.in file in (at least) two ways: constant pressure gradient (CPG) and constant flow rate (CFR). The latter case is an extension of the former. They are described in detail in this paper.

  • CPG. The natural choice of Reynolds number is the friction Reynolds number , built with the friction velocity , half the gap with and the fluid viscosity. One starts with selecting the numerical value of ; deciding the working fluid via the value of nu sets the value of the product , so that picking a value for sets the value of . In settings.in, the input variable Lz must be set to the value . Instead, the value of leads to the value of the input variable headx given by .

  • CFR (currently not implemented). The natural choice of Reynolds number is the bulk Reynolds number , built with the bulk velocity , half the gap with and the fluid viscosity. One starts with selecting the numerical value of ; deciding the working fluid via the value of nu sets the value of the product , so that picking a value for sets the value of . In settings.in, the input variable Lz must be set to the value . Instead, the value of leads to the value of the input variable meanu given by .

Nose flow example

This example shows how to carry out the DNS of the flow within a human nose, for a steady inspiration driven by the external ambient and the outflow boundary at the throat. The STL of an anatomy derived from a CT scan is provided (at a very coarse resolution). The setup uses about 5.93 million grid points, among which 1.78 million are internal points. The simulation is reasonable but quite under-resolved, yielding an approximately isotropic spatial resolution of 0.5 millimeters. Also, the STL anatomy is coarse. The initial condition is still fluid.

Before compilation of preproc, this case requires the following changes to the source files:

  • header.cpl: make sure that #define GEOMETRY_IS_STL is set;
  • header.cpl: make sure that #define GRID_IS_NONUNIFORM is not set;
  • solver.cpl: at lines 37/38, switch to USE modules-common/nose-specifics.

Now, follow the procedure above to run in sequence preproc, solver and postproc with case folder examples/nose, like:

mpirun -n <N> ./preproc examples/nose

In this example, five snapshots of the evolving flow field are saved and then converted in VTK; mean-flow statistics for mean velocity and pressure are also produced.

The case directory and its structure

preproc, solver and postproc all read and write files to/from the case folder, where they are organized in subfolders. Subfolders have constant names. The file names, instead, are as follows:

  • a fixed part, which informs of the content of the file (e.g. restart);
  • a variable part, which is unique for each cube, based on cube ID and the total number of cubes nDecompCubes;
  • the extension of the file (e.g. .vtk or .bin).

For example, by executing the program serially, the restart file is called restart.1of1.bin as only one cube is generated.

After the execution of postproc, the case directory is fully populated and presents the following structure:

folder-structure Figure Complete structure of a case folder (the file reference-output.dat is provided for the tutorials).

The cube-specific files are made by two parts:

  • an ASCII header, whose length of 8kB is set in header.cpl, which contains cube data and metadata, e.g. cube connectivity for the cube-data files, or the snapshot timestamp in the restart files
  • a binary chunk, which contains the saved data (e.g. IB coefficients, accumulated statistics, restart field, snapshots).

Binary data are written on file in the way they are organized in memory at run time, i.e. with SoA ordering (Structure of Arrays) in row-major order (as in C language: arrays are indexed with the most rapidly changing index last). The data structure is written to disk without discriminating between internal or external points. Not all data structures share the same size: IB coefficients, restart fields and snapshots data structures are defined also in the halo points of the cube, while statistics are not.

Performance

Performance analysis is still preliminary, and certain parts of the code do require substantial optimization. Nevertheless, the starting point in terms of raw computational performance is encouraging. The figure below illustrates the measured time, normalized for point and time (sub)step, for CPU and GPU execution, on a problems sized 10243 (strong scaling). The test hardware is the Leonardo Booster partition at CINECA. Each of the 3456 nodes has a single socket Intel Xeon Platinum 8358 with 32 cores, and four NVIDIA A100 GPUs with 64 GB RAM. The node has 512 GB DDR4 RAM; inter-node communications are serviced through NVIDIA Mellanox Infiniband with 200 Gbps and DragonFly+ topology. The GPUs in a node are interconnected through NVLink 3.0 for peer-to-peer memory transfers.

scaling

Figure Strong scaling on CPUs and GPUs (preliminary data).

The CPU execution has a scalar speed of 55 x 10-9 seconds per point/step on one core; efficiency is gradually affected until the cores of the first node get fully used. Afetr that, linear scaling is observed up to the maximum tested number of 1024 cores. The GPU execution has a single-GPU speed of 7.5 10-10 seconds per point/step: a case of size 10243 advances by one time step in 0.75 seconds. The multi-GPU scaling is approximately linear up to 32 GPUs, and then starts to drift gradually. Note how the memory efficiency allows us to run a relatively large case with over a billion points on a single core. In fact AMPHIBIOUS needs about 120 bytes per (internal) point, plus 4/8 additional bytes for each quantity for which statistics are accumulated.