GPU support

Overview

GPU support can be essential for certain algorithms and computation provided by the data tasks components. In this section a detailed explanation of how we foresee GPUs to be integrated into the RTC Tk is given. We also show how it has been done in the provided example.

The use of GPUs is to be confined only to the data tasks as these implement complex algorithms that may require the power GPUs to meet deadlines. As such, it is not desirable that the GPU dev tools interfere with the entire ELT dev environment. The use of GPUs shall not add requirements or add dependencies to the ELT dev env, be it compiler versions or other.

CUDA

Installation

If GPU support is required, it is currently assumed that the installation is handled by the user directly. The RTC Tk checks if the CUDA_PATH environment variable is present. If the CUDA_PATH enviromental variable is present the RTC Tk will load the required libraries and allow for CUDA compilation. This is not planned as a long term solution but for early releases, it gives an easy method of checking for the presence of a CUDA installation.

Waf Support

Building

The current implementation of waf and wtools provided by the elt dev env supports the use of CUDA and will offload any CUDA files (.cu) to Nvidias NVCC to compile. The designator .cu currently causes problems when specified in wscripts and declared using using sources=[]. A work around is shown below:

from wtools.module import declare_custom

def build(bld):
    bld.auto_cshlib(target='rtctkExampleDataTaskGpuLib',
                 features='cxx',
                 source=bld.path.ant_glob('src/*.cu'),
                 cxxflags=[''],
                 use=['cuda', 'cudart', 'cublas'])

declare_custom(provides="rtctkExampleDataTaskGpuLib")

Supported versions

The RTC Toolkit has only been tested on version 11-1 and configured with only this specific version. While other versions of CUDA may work, they are not officially supported by this release of the RTC Tk. To change the version used, it will be required to modify the top level wscript.

As part of the current top-level wscript we explicitly load cublas version 11-1 but this can be modified to be generic or load a different specific version. This is done by editing the line below:

cnf.check_cfg(package='cublas-11.1', uselib_store='CUBLAS', args='--cflags --libs', mandatory=False)

The wtools modifications to support CUDA have been tested with the RTC Tk on CentOS 8. The use of GPUs and CUDA is not supported on CentOS 7.

Example Data Task GPU

Along side the CPU based exampleDataTask we provide a telemetry-based Data Task example that uses a GPU to perform certain computations.

Source Code Location

The example source code can be found in the following sub-directory of the rtctk project:

_examples/exampleDataTask/gpu

Modules

The provided example is composed into the following waf modules:

  • app - Mirrors the app module in app from exampleDataTask.

  • gpuLib - Provides GPU specific code wrapped into a dedicated library.

  • scripts - Contains scripts for running the example.

The GPU specific code has been confined to gpuLib which, in part needs to be compiled with NVCC. The computation class in app will act as an interface for this gpuLib.

Classes

The example makes use of the following classes:

BusinessLogic

This class contains glue code that ties together the whole component. It is used to set up and control the computation and to exchange data with various repositories. The class is almost identical to the one used in the non-gpu exampleDataTask, it just instantiates a GPU computation.

Note

It is important to note, the ReaderThread handles the shared memory to access the telemetry is in a separate thread to the one running the computation. If a direct data copy is required from the shared memory to GPUs the GPU is required to be initalised on this thread. As such we provide the following function RegisterOnDataCallback to register a callback to initialise the GPU. This should be called during the Initialising state transitions. (see section ReaderThread Class)

Computation

The computation class owns a GpuLib object to which it delegates computations. Computation results are passed back to the BusinessLogic class that then uploads them to the Runtime Configuration Repository.

GpuLib

This class contains GPU specific code. In the class we show how GPU specific code can be wrapped in a library and linked to later. The class provides the following interface:

GpuLib(int input_length, int output_length, int gpu);
~GpuLib();

void SetMatrix(float * mat, bool flip = true);
std::vector<float> GetMatrix();
void ResetAvgSlopes();
std::vector<float> GetAvgSlopes();
std::vector<float> GetResults(bool download = false);
void NewSample(const float * sample, int callback_count);
void Compute();
void initReaderThread();

The majority of these methods are used to upload or download a specific vector or matrix to the GPU.

SetMatrix has an optional argument to transpose the matrix, since the RTC Tk only supports row major matrices and cuBLAS supports column major.

NewSample is the callback used for data processing.

void GpuLib::NewSample(const float *sample, int callback_count)
{
    cudaError_t error;
    error = cudaMemcpy(m_slopes_vector_d, sample, m_slopes * sizeof(float), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();
    PrintCudaError(error);
    CumulativeAverage_cuda<<<(m_slopes + 255) / 256, 256>>>(m_avg_slopes_d, m_slopes_vector_d, m_slopes, current_sample);
    current_sample++;
}

The callback copies the data into the GPU via cudaMemcpy then triggers the CumulativeAverage_cuda to be calculated asynchronously. This function does a cumulative average on the incoming pixels. This would mean the slope average that is done during the exampleDataTask compute is no longer required.

Note

It is not foreseen for any RTC Tk or other ELT dev env code to be used within the GPU specific class. We cannot guarantee the compiler support due to the NVCC dependency.

Limitations and Known Issues

The current GPU example code does not handle data transfer between Runtime Configuration Repository (RTR) and GPU ideally. Specifically this means that data is still copied more frequently than it should be. For the future we intend to improve this such that the number of spurious data copies is reduced to an absolute minimum.