Optimization Guidelines

This page collects the high-level principles that underlie effective DaCe optimization. The other pages in Optimizing Programs show how to apply specific transformations; this one is about what to aim for.

Minimize data movement

Data movement - between memory hierarchy levels, between devices, and across processes - dominates the cost of most modern workloads. The single most impactful optimization in DaCe is therefore to remove or shorten memory transfers. In practice this means:

  • Fuse states and maps so that producers and consumers of the same data share a scope. Map fusion (and the broader subgraph fusion family of transformations) eliminates the round-trip through main memory between fused nodes.

  • Stage data in fast memory with InLocalStorage, AccumulateTransient, and tiling transformations. Once a tile is in shared/registers, reuse it for as many computations as possible before evicting.

  • Avoid redundant transients by running the TransientReuse pass. Transients that are not reused inflate working sets without speeding up the program.

  • Specialize storage location. Set the storage attribute of arrays to match where they are produced and consumed; the codegen will emit the right allocation and copy logic.

Expose parallelism, then specialize the schedule

DaCe’s IR separates what is parallel (maps, library nodes) from how it is executed (schedules, storage, library implementations). The recommended order of operations is:

  1. Expose all the parallelism in the program. Use the Python frontend’s dace.map() syntax or transformations like LoopToMap and MapCollapse to produce maximally-parallel loop nests.

  2. Specialize the schedule to the target platform - use transformations such as GPUTransformSDFG or choose a schedule manually (GPU_Device / GPU_ThreadBlock schedules for CUDA, CPU_Multicore for OpenMP, etc.)

  3. Pick library implementations for any matrix multiplications, FFTs, reductions, or collectives. Library nodes default to a portable "pure" expansion; switching to "MKL", "cuBLAS", "NCCL", or another vendor implementation often delivers an order-of-magnitude speedup. See Working with Fast Linear Algebra (BLAS) Libraries.

Profile, do not guess

DaCe makes performance measurement cheap. Use it before deciding where to optimize:

A good rule of thumb: the earliest and largest speedups typically come from repeatedly looking for the “hottest” region and applying one or two transformations to improve its scheduling.

When in doubt, start with the auto-optimizer

The automatic heuristics are a strong baseline on most workloads. Calling auto_optimize once before doing manual optimization saves time and makes it easier to tell whether a manual change is actually helping. If the baseline is already close to peak (verifiable through profiling and a roofline-style calculation), it is often more productive to switch to algorithmic improvements than to keep tuning the schedule.