Improving CUDA Performance

One can regard the action of a CUDA kernel on a mesh as the distribution of elementary tasks (one mesh cell = one elementary task) to the CUDA cores of the GPU. The CUDA cores are distributed within streaming multiprocessors (SMP) on board the GPU. For instance, on a Kepler K20 GPU, there are 13 SMP, with 192 cores each (for single precision data), hence there are in total 2496 CUDA cores. In a similar manner, splitting the whole task into threads that perform elementary tasks on the CUDA core obeys a two-level hierarchy: the global mesh must be split in logical blocks, and the blocks are then split in threads. The user has to determine the size of the blocks in X, Y and Z. A given block runs on a single SMP. If you choose blocks that are too small, the SMPs are underused and the performance is degraded. If you choose blocks that are too large, the small amount of memory within the SMPs (48k) is saturated and the extra data is stored within the device’s global memory (the “Video RAM”), with a dramatic performance penalty. There are other considerations that matter the choice of the CUDA blocks (for instance memory alignment), but in short it is obvious that there is an optimal block size that will maximize performance. This size depends:

  • On the GPU
  • On the kernel itself (it is not the same for all kernels of FARGO3D).

By default, the block sizes used in a kernel execution are the numbers provided in the .opt file, which are “reasonable” numbers, but they are the same for all kernels (hence they cannot be optimal).

A makefile rule combined with python scripting has been developed in order to do perform a systematic test of the performance of each kernel, individually, as a function of the size of the CUDA blocks.

At compilation time, a file called setup.blocks (setup is the name of your setup) is looked for in the corresponding src/setup directory in order to provide to c2cuda.py the best block size for r each kernel. You could hand-write this file, but in practice, it is automatically generated by the makefile when you execute the rule called “blocks”:

make blocks setup=SETUP

It is necessary to use “setup” in lower case in order to avoid a misunderstanding with the SETUP variable. Example:

make blocks setup=fargo

And you will see lines similar to:

CompPresIso     64      8       1        appended
CompPresAd was skipped.
compute_slopes was skipped.
compute_star was skipped.
compute_emf was skipped.
update_magnetic was skipped.
substep1_x      16      8       1        appended
substep1_y      32      4       1        appended
substep1_z was skipped.
substep2_a      64      8       1        appended
...

and a file called fargo.blocks inside setups/fargo is created and is filled with this information, which represents the best block size for each kernel. All the functions skipped were skipped because they are not used in this particular setup.

It generally takes a few minutes. At the end, you have a .blocks file similar to:

CompPresIso     64      8       1
substep1_x      16      8       1
substep1_y      32      4       1
substep2_a      64      8       1
...

Now, each time you compile the code, this file is taken by the c2cuda.py script. In the best cases, you can increase the performance in a 10/20%. In 3D massive MHD problems, you will have a maximum gain.

Note: The .blocks file could be saved for the future if you want to save time. In theory, the .blocks file is hardware dependent. Be careful if you share the same file on multiple platforms.

MPICUDA

The considerations about GPU Direct and improvement of MPI communications between GPUs have been exposed in section CUDA aware MPI implementations.