Optimization Guidelines
This page collects the high-level principles that underlie effective DaCe optimization. The other pages in Optimizing Programs show how to apply specific transformations; this one is about what to aim for.
Minimize data movement
Data movement - between memory hierarchy levels, between devices, and across processes - dominates the cost of most modern workloads. The single most impactful optimization in DaCe is therefore to remove or shorten memory transfers. In practice this means:
Fuse states and maps so that producers and consumers of the same data share a scope. Map fusion (and the broader subgraph fusion family of transformations) eliminates the round-trip through main memory between fused nodes.
Stage data in fast memory with
InLocalStorage,AccumulateTransient, and tiling transformations. Once a tile is in shared/registers, reuse it for as many computations as possible before evicting.Avoid redundant transients by running the
TransientReusepass. Transients that are not reused inflate working sets without speeding up the program.Specialize storage location. Set the
storageattribute of arrays to match where they are produced and consumed; the codegen will emit the right allocation and copy logic.
Expose parallelism, then specialize the schedule
DaCe’s IR separates what is parallel (maps, library nodes) from how it is executed (schedules, storage, library implementations). The recommended order of operations is:
Expose all the parallelism in the program. Use the Python frontend’s
dace.map()syntax or transformations likeLoopToMapandMapCollapseto produce maximally-parallel loop nests.Specialize the schedule to the target platform - use transformations such as
GPUTransformSDFGor choose a schedule manually (GPU_Device/GPU_ThreadBlockschedules for CUDA,CPU_Multicorefor OpenMP, etc.)Pick library implementations for any matrix multiplications, FFTs, reductions, or collectives. Library nodes default to a portable
"pure"expansion; switching to"MKL","cuBLAS","NCCL", or another vendor implementation often delivers an order-of-magnitude speedup. See Working with Fast Linear Algebra (BLAS) Libraries.
Profile, do not guess
DaCe makes performance measurement cheap. Use it before deciding where to optimize:
Profiling and Instrumentation shows how to enable instrumentation on selected SDFG elements and how to interpret the resulting reports.
Using Visual Studio Code for Optimization overlays runtime measurements onto the SDFG view so that hot regions are visually obvious.
Interactive Optimization with the SDFG API describes how to drive the optimization loop from a notebook, applying transformations between profiling runs.
A good rule of thumb: the earliest and largest speedups typically come from repeatedly looking for the “hottest” region and applying one or two transformations to improve its scheduling.
When in doubt, start with the auto-optimizer
The automatic heuristics are a strong baseline on most
workloads. Calling auto_optimize once before doing manual optimization
saves time and makes it easier to tell whether a manual change is actually
helping. If the baseline is already close to peak (verifiable through
profiling and a roofline-style calculation), it is often more productive
to switch to algorithmic improvements than to keep tuning the schedule.