Cuda translator (C2CUDA.py).

One of the most interesting features of FARGO3D is that it can run on GPUs. If you look the src/ directory, there is only very few routines related with CUDA. The philosophy of the development of FARGO3D was to avoid the kernel-writing process. Instead, following a set of very simple rules, we were able to develop a python script that translates the time consuming C routines into CUDA kernels.

This section is a brief description of the rules and the process of building CUDA kernels from C functions.

FARGO3D Mesh Functions

In FARGO3D all the time consuming functions that are called during a hydro or MHD time step involve, without exception, a 3D nested loop over the computational mesh. We call hereafter these functions “mesh functions”. The C to CUDA translator has been developed to convert these “mesh functions” to CUDA kernels.

The component of any mesh function are:

  • A header
  • A function type and name
  • Arguments
  • Global variables
  • Local variables
  • A loop over the mesh

In some cases, the structure is a bit more complicated, including some additional lines at the end of the main loop, but this will be discussed below.

The easiest example of a mesh function is the Pressure computation:

#include "fargo3d.h"

void ComputePressureFieldIso_cpu () {

  real* dens = Density->field_cpu;
  real* cs   = Energy->field_cpu;
  real* pres = Pressure->field_cpu;

  int pitch  = Pitch_cpu;
  int stride = Stride_cpu;
  int size_x = Nx;
  int size_y = Ny+2*NGHY;
  int size_z = Nz+2*NGHZ;

  int i;
  int j;
  int k;

  for (k=0; k<size_z; k++) {
    for (j=0; j<size_y; j++) {
      for (i=0; i<size_x; i++ ) {
        pres[l] = dens[l]*cs[l]*cs[l];
      }
    }
  }
}

As you see, the general structure is very simple. Here is the explanation of the different blocks:

Header:

#include "fargo3d.h"

This block contains all the includes required to compile the code.

A function type:

void ComputePressureFieldIso_cpu () {

All the mesh functions must return a void. This is very important because CUDA kernels cannot but return a void (this is one of CUDA kernels limitations). Until now, the name of the function is not important, but you can see that its suffix is (must be, actually) _cpu.

Arguments

In this simple case, the function has not argument, but in general, you can have a wide range of arguments. The most commons are integers, reals & Field variables.

Global variables

real* dens = Density->field_cpu;
real* cs   = Energy->field_cpu;
real* pres = Pressure->field_cpu;
int pitch  = Pitch_cpu;
int stride = Stride_cpu;
int size_x = Nx;
int size_y = Ny+2*NGHY;
int size_z = Nz+2*NGHZ;

This block is related to all the global variables that are not passed by argument to the function. In practice, here you have all the fields required to achieve the calculation, the size of each loop, and some useful variables related with indices.

Local variables:

int i;
int j;
int k;

This block is devoted to variables that are neither passed as arguments nor global. Indices of the loop are always here, but it is possible to add any variable you want.

Main Loop:

for (k=0; k<size_z; k++) {
  for (j=0; j<size_y; j++) {
    for (i=0; i<size_x; i++ ) {
       pres[l] = dens[l]*cs[l]*cs[l];
    }
  }
}

This block is where all the expensive computation is done, and all the parsing process was developed to pass this job to the GPU. Remember that index l is a function of (i,j,k, pitch & stride), defined in src/define.h. There is no need to define nor calculate it here, it is done automatically at built time.

How the script works

In order to develop a general GPU “function”, there are a few problems that must be solved:

  • Develop a proper CUDA header.
  • Develop the kernel function, that is the core of the calculation.
  • Develop a launcher (or wrapper) function, which is called from the C code and constitutes the interface between the main stream of FARGO3D and the CUDA kernel.
  • Perform communications between host and device (between CPU and GPU).
  • Split the main loop into a lot of threads that are given to the CUDA cores.
  • Develop a method for passing global variables to the kernel.
  • Develop a method for passing complex structures (eg: Field) to a kernel.
  • A method for switching between the C function and the CUDA function, are run time (so that we can, among other things, compare the results of execution on the CPU to those on the GPU, in order to validate the correct GPU built).

You can see the structure of a mesh function is really very simple, and you can see that in all the code, the structure of mesh functions is the essentially the same. This allows us to develop an automatic process to generate CUDA code. A series of special lines were developed for simplifying the parsing process:

Here you have a complete example

//<FLAGS>
//#define __GPU
//#define __NOPROTO
//<\FLAGS>

//<INCLUDES>
#include "fargo3d.h"
//<\INCLUDES>

void ComputePressureFieldIso_cpu () {

//<USER_DEFINED>
  INPUT(Energy);
  INPUT(Density);
  OUTPUT(Pressure);
//<\USER_DEFINED>


//<EXTERNAL>
  real* dens = Density->field_cpu;
  real* cs   = Energy->field_cpu;
  real* pres = Pressure->field_cpu;
  int pitch  = Pitch_cpu;
  int stride = Stride_cpu;
  int size_x = Nx;
  int size_y = Ny+2*NGHY;
  int size_z = Nz+2*NGHZ;
//<\EXTERNAL>

//<INTERNAL>
  int i;
  int j;
  int k;
  int ll;
//<\INTERNAL>

//<MAIN_LOOP>

  i = j = k = 0;

#ifdef Z
  for (k=0; k<size_z; k++) {
#endif
#ifdef Y
    for (j=0; j<size_y; j++) {
#endif
#ifdef X
      for (i=0; i<size_x; i++ ) {
#endif
//<#>
        ll = l;
        pres[ll] = dens[ll]*cs[ll]*cs[ll];
//<\#>
#ifdef X
      }
#endif
#ifdef Y
    }
#endif
#ifdef Z
  }
#endif
//<\MAIN_LOOP>
}

As you see, all the main blocks are identified by some special comments (C comments on one line begin with //), but also there are two additional blocks.

We can make an abstract portrait of a general FARGO3D’s C mesh function:

//<FLAGS>
  Your preprocessor variables here (with a heading ``//``
  sign). __GPU and __NOPROTO must be defined, as in the example.
//<\FLAGS>

//<INCLUDES>
  your includes here ("fargo3d.h" must be in the list)
//<\INCLUDES>

function_name_cpu(arguments) {  // <== the name MUST end in ``_cpu``

  //<USER_DEFINED>
    Some general instructions.
  //<\USER_DEFINED>

  //<EXTERNAL>
    type internal_name = external_variable;
    type* internal_pointer = external_pointer_cpu; // <== the name MUST end in ``_cpu``.
  //<\EXTERNAL>

  //<INTERNAL>
    type internal_name1 = initialization;
    type internal_name2;
  //<\INTERNAL>

  //<CONSTANT>
    //type internal_name1(1);    //Note: these lines begin with a ``//``
    //type array_name2(size);
  //<\CONSTANT>

  //<MAIN_LOOP>
  #ifdef Z
     for (k=0; k<size_z; k++) {
  #endif
  #ifdef Y
       for (j=0; j<size_y; j++) {
  #endif
  #ifdef X
          for (i=0; i<size_x; i++ ) {
  #endif
  <#>
         Anything you want here.
  <\#>
  #ifdef X
          }
  #endif
  #ifdef Y
       }
  #endif
  #ifdef Z
     }
  #endif
  //<\MAIN_LOOP>

  //<LAST_BLOCK>
    Some final instruction(s).
  //<\LAST_BLOCK>

}

Below, there is an explanation of each field, an how to use it. You can browse the source files of mesh functions to see examples. You can have the list of the corresponding files by have a look at src/makefile. There, you see a block called GPU_OBJBLOCKS. All the mesh functions are found in the files that have same prefix as those found in this list, but with a .c suffix instead of _gpu.o.

FLAGS

Identificator

<FLAGS> ... <\FLAGS>

Parsed as:

Textually, with heading comment sign removal.

Location:

Normally, at the top of a  C file.

There are two flags that must be always included:

//#define __GPU
//#define __NOPROTO

They are important for a proper header building.

The __GPU can be used inside a C mesh function to issue specific lines that should be run only on the GPU version, with the help of the macrocommand:

#ifdef __GPU
#endif

INCLUDES

Identificator

<INCLUDES> ... <\INCLUDES>

Parsed as:

Textually

Location:

Normally, after the FLAGS block.

The INCLUDES block is the block where all the headers are. This block must contain at least:

#include "fargo3d.h"

You may include any other header file.

Function name:

Identificator

Implicit, only a string

Parsed as:

function_name_cpu --> function_name_gpu (for the wrapper or launcher)
                  --> function_name_kernel (for the associated CUDA kernel)

Location:

Normally, after the INCLUDES block.

Arguments:

Identificator

Implicit, only a string

Parsed as:

Field* f --> real f->field_gpu
non pointer argument --> textually

Location:

Normally, after the INCLUDES block.

Any built-in type is allowed, including FARGO3D’s real type. The only structure allowed is the “Field” structure.

There are two constrains about the arguments field:

  • All must be on the same line.
  • The Field structures are the last arguments.

USER DEFINED:

Identificator

<USER_DEFINED> ... <\USER_DEFINED>

Parsed as:

Textually

Location:

After the function name.

This block can be very general. You can do here a lot of things, because there is no limitation on syntax, everything inside is parsed textually. In practice, we use this block mainly to do memory transfers between host & device when they are needed, by issuing INPUT and OUTPUT directives.

This block is a kind of pre kernel-execution instructions, will be executed before the kernel launch, by the launcher (or wrapper) function.

In a similar way the post kernel-execution is the block called LAST_BLOCK.

EXTERNAL:

Identificator

<EXTERNAL> ... <\EXTERNAL>

Parsed as:

All the external variables are parsed as arguments of the kernel.
type variable = global_variable_cpu --> function_name_kernel(type global_variable_gpu)
type* variable = global_variable_cpu --> function_name_kernel(type* global_variable_gpu)

Location:

After the USER_DEFINED block.

The cuda kernels cannot see the global host variables. This block is meant to grant access to these variables to the kernels, so that all the variables dealt with in this block are global. The main rule when your draw a list of global variables inside the EXTERNAL block is:

  • Avoid the use of all capital variables. Instead, declare another variable with the same name but without capitals, and declare the variable equal to the global variable. example: (real omegaframe = OMEGAFRAME).

INTERNAL:

Identificator

<INTERNAL> ... <\INTERNAL>

Parsed as:

Textually

Location:

Normally, after the EXTERNAL block.

All the internal variables are work variables, with a very local scope. You find here the indices of the loops, but you could include any other variable that you need. Maybe, one of the most interesting examples on how to use this block is found in compute_force.c.

CONSTANT:

Identificator

<CONSTANT> ... <\CONSTANT>

Parsed as:

The variables inside are moved to the constant memory if BIGMEM
is not defined. If BIGMEM is defined, the variables are moved instead to the
device's global memory.  Example:

<CONSTANT>
// real ymin(Ny+2*NGHY+1);
<\CONSTANT>

Is parsed as:

#define ymin(i) ymin_s[(i)]

...

CONSTANT(real, ymin_s, some_number);

Here, some_number is a fraction of the constant memory size, calculated
by the parser.

...

#ifdef BIGMEM
#define ymin_d &Ymin_d
#endif
CUDAMEMCPY(ymin_s, ymin_d, sizeof(real)*(Ny+2*NGHY+1), 0, cudaMemcpyDeviceToDevice);

...

Location:

Normally, after the INTERNAL block.

This is one of the most complex blocks. It is very complex because the process is complex. We need a way to pass our light arrays to the constant memory of the GPU (for performance reasons). But there are some cases where the problem is too big and the constant memory is not enough. In this case, one can define the build flag BIGMEM. The main portrait of the process is:

Declare as constant some global variables you want to use without passing it as external, or use this block for light arrays, similar to xmin, sxk, etc. This block reserves a constant memory segment then copy the data to the this segment. If BIGMEM is activated, the constant memory is not used but instead, the global memory is used. CUDAMEMCPY macrocommand is expanded in a manner that depends on whether BIGMEM is defined and therefore performs the copy to the correct location (constant or global memory). You can see this in src/define.h. Note that there is a subtlety here: in case you use BIGMEM, the constant memory is still used to store the pointer to the array. In the other case it stores directly the array itself. The variables for this block are created in src/light_global_dev.c.

The scalar variables are passed as:

// type variable(1)

While the array variables are passed as:

// type variable(size_of_the_array)

Note that all this block is commented out in the C file, with // at the beginning of each line.

MAIN LOOP:

Identificator

<MAIN_LOOP> ... <\MAIN_LOOP>

Parsed as:

The size of the loops is read and parsed. Also, the indices are initialized:

Example:

      #ifdef Z
        for (k=NGHZ; k<size_z; k++) {
      #endif
      #ifdef Y
          for (j = NGHY; j<size_y; j++) {
      #endif
      #ifdef X
            for (i = 0; i<size_x; i++) {
      #endif

is parsed to:

      #ifdef X
      i = threadIdx.x + blockIdx.x * blockDim.x;
      #else
      i = 0;
      #endif
      #ifdef Y
      j = threadIdx.y + blockIdx.y * blockDim.y;
      #else
      j = 0;
      #endif
      #ifdef Z
      k = threadIdx.z + blockIdx.z * blockDim.z;
      #else
      k = 0;
      #endif

      #ifdef Z
      if(k>=NGHZ && k<size_z) {
      #endif
      #ifdef Y
      if(j >=NGHY && j <size_y) {
      #endif
      #ifdef X
      if(i <size_x) {
      #endif

The content of the loop is parsed textually, but formally, the
content is part of another block: "<#> <\#>".

Location:

Before the initialization of the indices i,j,k.

This block is very particular because it must to be closed after the close of the outermost loop. Remember that the content of the main loop block, defined by <#>..<\#>, nested within the MAIN_LOOP block, is parsed textually, so you cannot use global variables inside it or the generated CUDA code will not work.

LOOP CONTAIN:

Identificator

<#> ... <\#>

Parsed as:

Textually.

Location:

After the innermost loop or where you want to have a textual
parsing inside the main_loop.

If you put the beginning of this block not exactly after the innermost loop, you can make some interesting things. With this technique, you can skip some lines devoted to the CPU. You can also achieve this with the _GPU flag declared at the beginning, as shown earlier.

LAST_BLOCK

Identificator

<LAST_BLOCK> ... <\LAST_BLOCK>

Parsed as:

Textually.

Location:

The last block in the routine.

This block will be executed after the kernel execution. Useful for reductions, for instance.

Common errors:

This section describes the most common errors at compilation time using the parser:

  • The name of a block has white spaces: all blocks declarations must be closed and after that white spaces are not allowed:

    <BLOCK>__  --> wrong
    <BLOCK>    --> ok
    
  • The pointer type is not a type: For the parser, the type of a variable is a string without spaces. A pointer variables must be declared as the pointer type:

    real *rho  --> wrong
    real* rho  --> ok (the type is real*)
    
  • Some block was not properly closed: if you have not closed a block, undefined behavior may result:

    <BLOCK>
    <BLOCK>      --> wrong
    
    <BLOCK>
    </BLOCK>     --> wrong
    
    <BLOCK>
    <\BLOCK>     --> ok
    
  • The end of the name function is not _cpu: The parser cannot find the function name if it does not have a valid name, neither can it invent a rule to make the wrapper and kernel names.

  • The order of the arguments in the function: Remember, pointers to Field structures are at the end.

  • Only one variable per line may appear in the INTERNAL and EXTERNAL fields.

  • Files which have been edited on a Windows machine, and in which lines end with ‘rn’ instead of ending with ‘n’, will fail to be converted to CUDA. Use a conversion procedure such as tr -d '\d' < original_file.c > correct_line_ending.c prior to the build.

  • Values relative to the mesh (such as zmin or xmed, etc.) should be lower case and should be followed by parentheses, not square brackets, because they are considered as macrocommands. You can use substep1_x.c as a template for that.

This list may be completed as we receive users’ feedback.