.. _opt_auto: Auto-Optimization Pass (Experimental) ===================================== Like :ref:`simplify`, auto-optimization is a set of passes applied on an SDFG in a specific order. The goal of this pass is to provide automatic optimization heuristics. This includes finding parallelism in programs, optimizing common performance pitfalls (such as excessive allocation/deallocation), finding fast library implementations (e.g., BLAS-optimized routines for linear algebra operations), and more. .. warning:: Auto-optimization depends on the graph and the target platform. As it is still an experimental feature, it is **not** applied automatically on every graph and *may break correctness of certain programs*, so use it with caution. How to trigger auto-optimization -------------------------------- There are several ways to invoke auto-optimization. First, you can configure it to run on any program in ``.dace.conf`` under :envvar:`optimizer.autooptimize` (or setting the ``DACE_optimizer_autooptimize`` environment variable to ``1``). Another way to do so is on the ``@dace`` decorator, as shown below: .. code-block:: python import dace @dace.program(auto_optimize=True, device=dace.DeviceType.GPU) def hello_world(A, B): return A @ B Lastly, it can be triggered on arbitrary SDFGs by calling the auto-optimization pass: .. code-block:: python import dace from dace.transformation.auto import auto_optimize as aopt opt_sdfg = aopt.auto_optimize(sdfg, dace.DeviceType.CPU) What does the pass contain? --------------------------- The auto-optimization pass (:func:`~dace.transformation.auto.auto_optimize.auto_optimize`) includes the following transformations, applied in this order: * Loop-to-map conversion (auto-parallelization): :class:`~dace.transformation.interstate.loop_to_map.LoopToMap` * :ref:`simplify` * Multi-dimensional :class:`~dace.transformation.dataflow.map_collapse.MapCollapse` to parallelize across multiple dimensions. * Greedy subgraph fusion (fusing contents of maps with common dimensions to reduce data movement). See :class:`~dace.transformation.subgraph.subgraph_fusion.SubgraphFusion` for more information. * Move loops into maps (when memory access pattern permits) in order to increase the granularity of work threads perform (:class:`~dace.transformation.interstate.move_loop_into_map.MoveLoopIntoMap`). * Tiling of maps with write-conflict resolution to reduce atomic operations (tile sizes are configurable via :envvar:`optimizer.autotile_size`). Partial parallelism (non-conflicting dimensions) can also be extracted to convert atomics to simple updates (configurable in :envvar:`optimizer.autotile_partial_parallelism`, True by default). * Set all library nodes to expand to fast implementations: first using the ``fast`` expansion if exists, and then via heuristics for choosing the fastest library for the target device (e.g., MKL on CPU if available, CUBLAS on GPU). * Disable OpenMP sections (usually increases performance at the expense of reducing parallelism within a state). * Specialize known symbolic values to the known constants. * Move small arrays from heap to stack (threshold is also configurable in :envvar:`optimizer.autotile_size`). * Make transient data containers' allocation lifetime :class:`dace.dtypes.AllocationLifetime.Persistent`, if possible. This moves allocation and deallocation out of the critical code path and into the SDFG init/exit functions. Apart from those, the pass transforms the SDFG to run on the specified platform (e.g. GPU).