Auto-Optimization Pass (Experimental)
Like Simplify Pipeline, auto-optimization is a set of passes applied on an SDFG in a specific order. The goal of this pass is to provide automatic optimization heuristics. This includes finding parallelism in programs, optimizing common performance pitfalls (such as excessive allocation/deallocation), finding fast library implementations (e.g., BLAS-optimized routines for linear algebra operations), and more.
Auto-optimization depends on the graph and the target platform. As it is still an experimental feature, it is not applied automatically on every graph and may break correctness of certain programs, so use it with caution.
How to trigger auto-optimization
There are several ways to invoke auto-optimization. First, you can configure it to run on any program in
optimizer.autooptimize (or setting the
DACE_optimizer_autooptimize environment variable to
Another way to do so is on the
@dace decorator, as shown below:
import dace @dace.program(auto_optimize=True, device=dace.DeviceType.GPU) def hello_world(A, B): return A @ B
Lastly, it can be triggered on arbitrary SDFGs by calling the auto-optimization pass:
import dace from dace.transformation.auto import auto_optimize as aopt opt_sdfg = aopt.auto_optimize(sdfg, dace.DeviceType.CPU)
What does the pass contain?
The auto-optimization pass (
auto_optimize()) includes the following
transformations, applied in this order:
Loop-to-map conversion (auto-parallelization):
MapCollapseto parallelize across multiple dimensions.
Greedy subgraph fusion (fusing contents of maps with common dimensions to reduce data movement). See
SubgraphFusionfor more information.
Move loops into maps (when memory access pattern permits) in order to increase the granularity of work threads perform (
(for FPGAs) Interleave data containers (e.g. arrays) in off-chip memory banks, and use local memory (e.g. BRAM) when possible.
Tiling of maps with write-conflict resolution to reduce atomic operations (tile sizes are configurable via
optimizer.autotile_size). Partial parallelism (non-conflicting dimensions) can also be extracted to convert atomics to simple updates (configurable in
optimizer.autotile_partial_parallelism, True by default).
Set all library nodes to expand to fast implementations: first using the
fastexpansion if exists, and then via heuristics for choosing the fastest library for the target device (e.g., MKL on CPU if available, CUBLAS on GPU).
Disable OpenMP sections (usually increases performance at the expense of reducing parallelism within a state).
Specialize known symbolic values to the known constants.
Move small arrays from heap to stack (threshold is also configurable in
Make transient data containers’ allocation lifetime
dace.dtypes.AllocationLifetime.Persistent, if possible. This moves allocation and deallocation out of the critical code path and into the SDFG init/exit functions.
Apart from those, the pass transforms the SDFG to run on the specified platform (e.g., GPU, FPGA).