ROSE Compiler Framework/OpenMP Acclerator Model Implementation

From Wikibooks, open books for an open world
Jump to: navigation, search

We are experimenting a trial implementation of the OpenMP accelerator extension in OpenMP 4.0 specification. It is not meant to be the official or complete implementation due to the fast changing nature of the specification and our resource limitations.

The implementation is based on the existing OpenMP implementation in ROSE.

Implemented Features[edit]

The implementation loosely follows the technical report http://www.openmp.org/mp-documents/TR1_167.pdf and later http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf .

  • pragma omp target + pragma omp parallel for
  • map clauses
  • reduction on GPUs
  • pragma omp target data
  • collapse clause

Compile and Run[edit]

We incrementally release the work as part of the ROSE (EDG 4.x based version) compiler's hosted at https://github.com/rose-compiler/edg4x-rose

  • clone it
  • configure and build librose.so under buildtree/src
  • build the translator roseompacc under tests/roseTests/ompLoweringTests/ by typing "make roseompacc"

The translator source is rose-omp-accelerator/tests/roseTests/ompLoweringTests/roseompacc.C

The command lines (two steps for now) to use roseompacc is exemplified in rose-omp-accelerator/tests/roseTests/ompLoweringTests/Makefile.am

# Experimental translation for OpenMP accelerator model directives
# no final compilation for now, which requires CUDA compiler for the generated code
test_omp_acc:axpy_ompacc.o matrixmultiply-ompacc.o jacobi-ompacc.o
rose_axpy_ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/axpy_ompacc.c 
rose_matrixmultiply-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/matrixmultiply-ompacc.c 
rose_jacobi-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/jacobi-ompacc.c 

# build executables using nvcc
axpy_ompacc.out:rose_axpy_ompacc.cu
        nvcc $<  $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
matrixmultiply-ompacc.out:rose_matrixmultiply-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
jacobi-ompacc.out:rose_jacobi-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@

Implementation Details[edit]

Parsing[edit]

flex/bison

IR generation: ROSETTA

Translation[edit]

translation:

Internally, it invokes the outliner acting in mode of classic (passing individual parameters without any wrapping array or structure)

  • Outliner::enable_classic = true;

Runtime Support[edit]

runtime support

Loop scheduling[edit]

Runtime loop scheduling must be provided since the naive 1-iteration-to-1-thread mapping in CUDA tutorials won't work for large iteration space. (iterations count > total GPU threads).

// Jacobi 2-D kernel: 
// Naive 1-iteration to 1 GPU thread scheduling: each GPU thread gets one iteration to work on.
// Problem: won't scale to large iteration space.
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    newa[j*m+i] = w0*a[j*m+i] +
            w1 * (a[j*m+i-1] + a[(j-1)*m+i] +
                  a[j*m+i+1] + a[(j+1)*m+i]) +
            w2 * (a[(j-1)*m+i-1] + a[(j+1)*m+i-1] +
                  a[(j-1)*m+i+1] + a[(j+1)*m+i+1]);


We provide two loop schedulers in the runtime so each CUDA thread may get more than 1 iterations to handle large iteration space.

  1. a static even scheduler: this scheduler evenly divides up the iteration space into roughly equal sized chunks. It then assigns each chunk to a CUDA thread. This scheduling policy may stress the memory too much since each thread will touch a large range of data brought in by the iteration chunk
  2. a round-robin scheduler: Each thread grabs one iteration (or more iterations) at a time. It is identical to OpenMP's schedule(static, chunk) policy.

Testing use of the schedulers (using corresponding CPU versions) are given at:

GPU version use example: the round-robin scheduler (now the new default scheduler)

void OUT__2__10550__(int n,int *_dev_u)
{
  int ij;
  int _dev_lower, _dev_upper;


  // variables for adjusted loop info considering both original chunk size and step(strip)
  int _dev_loop_chunk_size;
  int _dev_loop_sched_index;
  int _dev_loop_stride;

  // 1-D thread block:
  int _dev_thread_num = omp_get_num_threads();
  int _dev_thread_id = omp_get_thread_num();
  printf ("thread count = %d, current thread id = %d\n", _dev_thread_num, _dev_thread_id);

  int orig_start =0; // must be correct!!
  int orig_end = n-1; // use inclusive bound
  int orig_step = 1;
  int orig_chunk_size = 1;

  XOMP_static_sched_init (orig_start, orig_end, orig_step, orig_chunk_size, _dev_thread_num, _dev_thread_id, \
      & _dev_loop_chunk_size , & _dev_loop_sched_index, & _dev_loop_stride);

  printf ("Initialized chunk size = %d, sched indx =%d, stride = %d\n",_dev_loop_chunk_size, _dev_loop_sched_index, _dev_loop_stride);

  while (XOMP_static_sched_next (&_dev_loop_sched_index, orig_end, orig_step, _dev_loop_stride, _dev_loop_chunk_size, _dev_thread_num, _dev_thread_id, & _dev_lower
        , & _dev_upper))
  {
    printf ("Thread ID: %d Allocated lower = %d upper = %d\n", _dev_thread_id, _dev_lower, _dev_upper);
    for (ij = _dev_lower ; ij <= _dev_upper; ij ++) { // using inclusive bound here
        _dev_u[ij] += (n - ij);         
    }
  }
}

Example of how to use the static-even scheduler (no longer the default scheduler in recent release due to performance drawbacks.):

// using a scheduler
__global__ void OUT__1__11058__(int j,int k,float *_dev_a,float *_dev_b,float *_dev_c)
{
  int _dev_i;
  long _dev_lower, _dev_upper;
  XOMP_accelerator_loop_default (0, MSIZE -1 , 1, &_dev_lower, &_dev_upper);

  for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) 
  {
    for (j = 0; j < MSIZE; j++)
    {
      float c= 0.0;
      for (k = 0; k < MSIZE; k++)
        c += _dev_a[_dev_i * MSIZE + k] * _dev_b[k * MSIZE + j];
      _dev_c[_dev_i * MSIZE + j] = c;
    }
  }
}

Testing[edit]

Input files under https://github.com/rose-compiler/edg4x-rose/tree/master/tests/CompileTests/OpenMP_tests

  • axpy_ompacc.c
  • axpy_ompacc2.c
  • matrixmultiply-ompacc.c
  • jacobi-ompacc.c
  • jacobi-ompacc-opt1.c // use "target data" region

Test directory https://github.com/rose-compiler/edg4x-rose/tree/master/tests/roseTests/ompLoweringTests

  • Test translator: roseompacc.C
  • Test driver: Makefile.am
  • make ompacc_test within the build tree will trigger the associated testin