ROSE Compiler Framework/OpenMP Acclerator Model Implementation

Overview

We are experimenting a trial implementation of the OpenMP accelerator extension in OpenMP 4.0 specification. It is not meant to be the official or complete implementation due to the fast changing nature of the specification and our resource limitations.

The implementation is based on the existing OpenMP implementation in ROSE.

Implemented Features

The implementation loosely follows the technical report http://www.openmp.org/mp-documents/TR1_167.pdf and later http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf .

pragma omp target + pragma omp parallel for
map clauses
reduction on GPUs
pragma omp target data
collapse clause

Compile and Run

We incrementally release the work as part of the ROSE (EDG 4.x based version) compiler's hosted at https://github.com/rose-compiler/rose-develop

clone it
configure and build librose.so under buildtree/src
build the translator roseompacc under tests/nonsmoke/functional/roseTests/ompLoweringTests/ by typing "make roseompacc"

The translator source is rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/roseompacc.C

The command lines (two steps for now) to use roseompacc is exemplified in rose-omp-accelerator/tests/nonsmoke/functional/roseTests/ompLoweringTests/Makefile.am

# Experimental translation for OpenMP accelerator model directives
# no final compilation for now, which requires CUDA compiler for the generated code
test_omp_acc:axpy_ompacc.o matrixmultiply-ompacc.o jacobi-ompacc.o
rose_axpy_ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/axpy_ompacc.c 
rose_matrixmultiply-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/matrixmultiply-ompacc.c 
rose_jacobi-ompacc.cu:roseompacc
        ./roseompacc$(EXEEXT) ${TEST_FLAGS} -rose:openmp:lowering -rose:skipfinalCompileStep -c $(TEST_DIR)/jacobi-ompacc.c 

# build executables using nvcc
axpy_ompacc.out:rose_axpy_ompacc.cu
        nvcc $<  $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
matrixmultiply-ompacc.out:rose_matrixmultiply-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@
jacobi-ompacc.out:rose_jacobi-ompacc.cu
        nvcc $< $(TEST_INCLUDES) $(top_srcdir)/src/midend/programTransformation/ompLowering/xomp_cuda_lib.cu -o $@

Implementation Details

Parsing

flex/bison

AST attribute storing parsing results

IR generation: ROSETTA, convert attribute to dedicated AST

https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/ompAstConstruction.cpp

Translation

translation:

https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/omp_lowering.cpp
- void transOmpTargetParallel() translate a parallel region associated with a target directive
- void transOmpTargetLoop() translate an omp for loop associated with a target directive
- void transOmpMapVariables() translate data mapping clauses

Internally, it invokes the outliner acting in mode of classic (passing individual parameters without any wrapping array or structure)

Outliner::enable_classic = true;

Runtime Support

runtime support

Rules about what to put into each file:

libxomp.h runtime interface to compiler writers
xomp.c runtime support for CPU threading
xomp_cuda_lib.cu provides the regular C functions wrapping CUDA code (without the __device__ keyword) called by the host code.
xomp_cuda_lib_inlined.cu provides the __device__ functions called by the CUDA kernel functions (with __global__ keyword).

At the time the work being done, CUDA and/or nvcc does not have linker for device code (linking __global__ and _device_ functions from different files). We have to put some common device functions into this file. So they can be included in the same file in which the generated CUDA kernels are generated (__global__ functions can only call __device__ functions from the same file).

Device Data Management

The runtime provides two ways to manage device data

explicit manage of data allocation, copy-to, copy-back, and deallocation
automatic management using device data environment (DDE) functions
- xomp_deviceDataEnvironmentEnter(); // enter an environment , pushed to an environment stack
- xomp_deviceDataEnvironmentPrepareVariable(); // reuse data if already in parent environments, or allocate, register with the current environment stack
- xomp_deviceDataEnvironmentExit(); // auto deallocation, copy back data. pop environment stack

Examples explicit data management

int mmm()
{
{
    float *_dev_a;
    int _dev_a_size = sizeof(float ) * N * M;
    _dev_a = ((float *)(xomp_deviceMalloc(_dev_a_size)));
    xomp_memcpyHostToDevice(((void *)_dev_a),((const void *)a),_dev_a_size);
    float *_dev_b;
    int _dev_b_size = sizeof(float ) * M * K;
    _dev_b = ((float *)(xomp_deviceMalloc(_dev_b_size)));
    xomp_memcpyHostToDevice(((void *)_dev_b),((const void *)b),_dev_b_size);
    float *_dev_c;
    int _dev_c_size = sizeof(float ) * N * M;
    _dev_c = ((float *)(xomp_deviceMalloc(_dev_c_size)));
    xomp_memcpyHostToDevice(((void *)_dev_c),((const void *)c),_dev_c_size);
/* Launch CUDA kernel ... */
    int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
    int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1);
    OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c);
    xomp_freeDevice(_dev_a);
    xomp_freeDevice(_dev_b);
    xomp_memcpyDeviceToHost(((void *)c),((const void *)_dev_c),_dev_c_size);
    xomp_freeDevice(_dev_c);
  }
  return 0;
}

Automated data management supporting reusing nested device data environments

int mmm()
{
{
    xomp_deviceDataEnvironmentEnter();
    float *_dev_a;
    int _dev_a_size = sizeof(float ) * 1024 * 1024;
    _dev_a = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)a),_dev_a_size,1,0)));
    float *_dev_b;
    int _dev_b_size = sizeof(float ) * 1024 * 1024;
    _dev_b = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)b),_dev_b_size,1,0)));
    float *_dev_c;
    int _dev_c_size = sizeof(float ) * 1024 * 1024;
    _dev_c = ((float *)(xomp_deviceDataEnvironmentPrepareVariable(((void *)c),_dev_c_size,1,1)));
/* Launch CUDA kernel ... */
    int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
    int _num_blocks_ = xomp_get_max1DBlock(1023 - 0 + 1);
    OUT__1__9221__<<<_num_blocks_,_threads_per_block_>>>(_dev_a,_dev_b,_dev_c);
    xomp_deviceDataEnvironmentExit();
  }
  return 0;
}

Environment variables

Control the number of devices being used in the multi-GPU support by default

 export OMP_NUM_DEVICES=5

The runtime will automatically detect this environment variable and use it to control the GPU count.

You can retrieve this count n the code, like

 int GPU_N = xomp_get_num_devices(); // this function will obtain the env variable.

If OMP_NUM_DEVICES is specified, xomp_get_num_devices() will internally call int omp_get_max_devices(void) to obtain the max number of devices supported by the hardware.

Relevant commit

Loop scheduling

Runtime loop scheduling must be provided since the naive 1-iteration-to-1-thread mapping in CUDA tutorials won't work for large iteration space. (iterations count > total GPU threads).

// Jacobi 2-D kernel: 
// Naive 1-iteration to 1 GPU thread scheduling: each GPU thread gets one iteration to work on.
// Problem: won't scale to large iteration space.
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    newa[j*m+i] = w0*a[j*m+i] +
	    w1 * (a[j*m+i-1] + a[(j-1)*m+i] +
		  a[j*m+i+1] + a[(j+1)*m+i]) +
	    w2 * (a[(j-1)*m+i-1] + a[(j+1)*m+i-1] +
		  a[(j-1)*m+i+1] + a[(j+1)*m+i+1]);

We provide two loop schedulers in the runtime so each CUDA thread may get more than 1 iterations to handle large iteration space.

a static even scheduler: this scheduler evenly divides up the iteration space into roughly equal sized chunks. It then assigns each chunk to a CUDA thread. This scheduling policy may stress the memory too much since each thread will touch a large range of data brought in by the iteration chunk
a round-robin scheduler: Each thread grabs one iteration (or more iterations) at a time. It is identical to OpenMP's schedule(static, chunk) policy.

Testing use of the schedulers (using corresponding CPU versions) are given at:

https://github.com/rose-compiler/rose-develop/blob/master/src/midend/programTransformation/ompLowering/xomp_accelerator_sched_test_v2.c

GPU version use example: the round-robin scheduler (now the new default scheduler)

void OUT__2__10550__(int n,int *_dev_u)
{
  int ij;
  int _dev_lower, _dev_upper;


  // variables for adjusted loop info considering both original chunk size and step(strip)
  int _dev_loop_chunk_size;
  int _dev_loop_sched_index;
  int _dev_loop_stride;

  // 1-D thread block:
  int _dev_thread_num = omp_get_num_threads();
  int _dev_thread_id = omp_get_thread_num();
  printf ("thread count = %d, current thread id = %d\n", _dev_thread_num, _dev_thread_id);

  int orig_start =0; // must be correct!!
  int orig_end = n-1; // use inclusive bound
  int orig_step = 1;
  int orig_chunk_size = 1;

  XOMP_static_sched_init (orig_start, orig_end, orig_step, orig_chunk_size, _dev_thread_num, _dev_thread_id, \
      & _dev_loop_chunk_size , & _dev_loop_sched_index, & _dev_loop_stride);

  printf ("Initialized chunk size = %d, sched indx =%d, stride = %d\n",_dev_loop_chunk_size, _dev_loop_sched_index, _dev_loop_stride);

  while (XOMP_static_sched_next (&_dev_loop_sched_index, orig_end, orig_step, _dev_loop_stride, _dev_loop_chunk_size, _dev_thread_num, _dev_thread_id, & _dev_lower
        , & _dev_upper))
  {
    printf ("Thread ID: %d Allocated lower = %d upper = %d\n", _dev_thread_id, _dev_lower, _dev_upper);
    for (ij = _dev_lower ; ij <= _dev_upper; ij ++) { // using inclusive bound here
        _dev_u[ij] += (n - ij);         
    }
  }
}

Example of how to use the static-even scheduler (no longer the default scheduler in recent release due to performance drawbacks.):

// using a scheduler
__global__ void OUT__1__11058__(int j,int k,float *_dev_a,float *_dev_b,float *_dev_c)
{
  int _dev_i;
  long _dev_lower, _dev_upper;
  XOMP_accelerator_loop_default (0, MSIZE -1 , 1, &_dev_lower, &_dev_upper);

  for (_dev_i = _dev_lower; _dev_i<= _dev_upper; _dev_i ++) 
  {
    for (j = 0; j < MSIZE; j++)
    {
      float c= 0.0;
      for (k = 0; k < MSIZE; k++)
        c += _dev_a[_dev_i * MSIZE + k] * _dev_b[k * MSIZE + j];
      _dev_c[_dev_i * MSIZE + j] = c;
    }
  }
}

Testing

Input files under https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/CompileTests/OpenMP_tests

axpy_ompacc.c
axpy_ompacc2.c
matrixmultiply-ompacc.c
jacobi-ompacc.c
jacobi-ompacc-opt1.c // use "target data" region

Test directory https://github.com/rose-compiler/rose-develop/tree/master/tests/nonsmoke/functional/roseTests/ompLoweringTests

Test translator: roseompacc.C
Test driver: Makefile.am
make ompacc_test within the build tree will trigger the associated testing
- Make sure you have nvidia nvcc compiler installed to generate the executables. Otherwise only cuda file will be generated and the final compilation will fail.

Some sample generated CUDA files can be found at:

https://github.com/chunhualiao/homp-demo

publication

C. Liao, Y. Yan, B. de Supinsky, D. Quinlan, B. Chapman, Early Experiences With The OpenMP Accelerator Model, IWOMP 2013, https://e-reports-ext.llnl.gov/pdf/755563.pdf

ROSE Compiler Framework/OpenMP Acclerator Model Implementation

Contents

Overview

Implemented Features

Compile and Run

Implementation Details

Parsing

Translation

Runtime Support

Device Data Management

Environment variables

Loop scheduling

Testing

publication

Navigation menu

ROSE Compiler Framework/OpenMP Acclerator Model Implementation

Overview

Implemented Features

Compile and Run

Implementation Details

Parsing

Translation

Runtime Support

Device Data Management

Environment variables

Loop scheduling

Testing

publication

Navigation menu

Search