GPU vs CPU Benchmarking

Some acceleration factors

At the present time FARGO3D has run on a limited number of platforms, so we have only a limited amount of acceleration factors to quote between CPU and GPU.

The FARGO_SPEEDUP macrocommand

We have developed a macrocommand named FARGO_SPEEDUP. You can see its source in the file src/define.h, near the line 475. This macrocommand is meant to give the speed up factor of a given CUDA kernel with respect to its CPU counter. Its use is overly simple. Suppose that we want to know the speed up ratio GPU vs CPU of the function SubStep1_x(), for the setup fargo.

First, we need to identify where this function is invoked. It is called near line 73 in the file src/algogas.c:

#ifdef X
    FARGO_SAFE(SubStep1_x(dt));
#endif

Note that the invocation is wrapped in a FARGO_SAFE macrocommand, the definition of which is... empty (see file src/define.h near line 409). All the sub steps of FARGO3D are wrapped similarly into this macrocommand. In normal use, it does not do anything. However, it may be redefined (see the alternate definitions commented out near line 409 in src/define.h), so as to provide useful debugging diagnostics.

What we need to do here to get an automatic evaluation of the speed up factor is simply to change our wrapper from FARGO_SAFE to FARGO_SPEEDUP. Note that this new macrocommand will manipulate a bit the function name (it will subsequently invoke SubStep1_x_cpu() then SubStep1_x_gpu(). Since the C preprocessor is unable to manipulate strings, we need to help it identify where the sub-string of arguments begins, by inserting a comma:

#ifdef X
    FARGO_SPEEDUP(SubStep1_x,(dt));  // <= Note the comma before the '('
#endif

We now build the code for the target setup, with the PROFILING and GPU options enabled:

make SETUP=mri GPU=1 PROFILING=1

and we run it:

./fargo3d -m in/mri.par

You should see an output such as:

Wall clock time elapsed during MPI Communications : 0.030 s
OUTPUTS 0 at Physical Time t = 0.000000 OK
TotalMass = 0.0271300282


******
Check point created
******



******
Check point restored
*******

GPU/CPU speedup in SubStep1_x: 22.775
CPU time : 91.1 ms
GPU time : 4 ms

We see that the function is timed both in its CPU and GPU version (this test was obtained on an Intel(R) Core(TM) i7 950 at 3.07 GHz, and on a Tesla C2050 card). We also note how execution continues after the evaluation, so that periodically an evaluation of the speed up of our target function is provided. It is interesting to see how the macrocommand is expanded by the preprocessor:

{
  SynchronizeHD ();
  SaveState ();
  InitSpecificTime (&t_speedup_cpu, "");
  for (t_speedup_count=0; t_speedup_count < 200; t_speedup_count++) {
    SubStep1_x_cpu (dt);
  }
  time_speedup_cpu = GiveSpecificTime (t_speedup_cpu);
  SynchronizeHD ();
  RestoreState ();
  InitSpecificTime (&t_speedup_gpu, "");
  for (t_speedup_count=0; t_speedup_count < 2000; t_speedup_count++) {
    SubStep1_x_gpu (dt);
  }
  time_speedup_gpu = GiveSpecificTime (t_speedup_gpu);
  printf ("GPU/CPU speedup in %s: %g\n", "SubStep1_x", time_speedup_cpu/time_speedup_gpu*10.0);
  printf ("CPU time : %g ms\n", 1e3*time_speedup_cpu/200.0);
  printf ("GPU time : %g ms\n", 1e3*time_speedup_gpu/2000.0);
};

(a proper indentation has been added for legibility.)

We see that the function is firstly executed 200 times on the CPU, then 2000 times on the GPU. The respective single times on CPU and GPU are inferred, and thus the speed up ratio.

Note that we have developed another useful macrocommand in the same spirit, called FARGO_DEBUG, which is meant to automatically compare the result of the CPU version of one routine with the result of its GPU counterpart. It is presented in section Using FARGO_DEBUG.