Starting GPU programming with Kokkos
Posted 9th March 2023 by Holger Schmitz
The high performance computing landscape has changed quite a bit over the last years. For quite a few decades, the major architecture for HPC systems was based on classical CPU processors distributed over many nodes. The nodes are connected via a fast network architecture. These compute clusters have evolved over time. A typical simulation code that makes use of this architecture will consist of more or less serial code that runs on each core and communicates with the other cores via the message passing library MPI. This has the drawback that the serial instances of the simulation code running on a shared memory node still have to exchange data through this mechanism, which takes time and reduces efficiency. On many current systems, each node can have on the order of 10s of compute cores accessing the same shared memory. To make better use of these cores, multithreading approaches, such as OpenMP, have been used increasingly over the past decade or so. However, there are not many large scale simulation codes that make use of OpenMP and MPI at the same time to reach maximum performance.
More recently, there has been new style of architecture that has become a serious player. GPU processors that used to play more of an outsider role in serious high performance computing have taken centre stage. Within the DoE’s Exascale project in the US, new supercomputers are being built that are breaking the exaflop barrier. These machines rely on a combination of traditional CPU cluster with GPU accelerators.
For developers of high performance simulation codes this means that much of their code has to be re-written to make use of the new mixed architectures. I have been developing the Schnek library for a number of years now and I am using it form many of my simulation codes. To keep the library relevant for the future, I am currently in the process of adding mixed architecture capabilities to Schnek. Fortunately, there already exists the Kokkos library which provides an abstraction layer over the different machine architectures. It also provides multi-dimensional arrays that can be stored on the host memory or on the accelerator’s device memory. It also provides iteration mechanisms to process these arrays on the CPU or GPU.
In this post, I am describing my first experiences with Kokkos and how it can be used to accelerate calculations. I will be starting from scratch which means that I will begin by installing Kokkos. The best way to do this nowadays is to use a packet manager called Spack. Spack is designed as a packet manager for supercompters but will work on any regular Linux machine. To start, I navigate into a folder where I want to install spack, let’s call it /my/installation/folder
. I then simply clone Spack’s git repository.
git clone -c feature.manyFiles=true https://github.com/spack/spack.git
I am using the bash shell, so now I can initialise all the environment variables by running
source /my/installation/folder/spack/share/spack/setup-env.sh
This command should be run every time you want to start developing code using the packages supplied by Spack. I recommend not putting this line into your .bashrc
because some of the Spack packages might interfere with your regular system operation.
The Kokkos package has many configuration options. You can now look at them using the command
spack info kokkos
I have an NVidia Quadro P1000 graphics card on my local development machine. For NVidia accelerators, I will need to install Kokkos with Cuda support. In order to install the correct Cuda version, I first check on https://developer.nvidia.com/cuda-gpus for the compute capability. For my hardware, I find that I will need version 6.1. The following command will install Kokkos and all of its dependencies.
spack install kokkos +cuda +wrapper cuda_arch=61
The wrapper
option is required because I am compiling everything with the gcc
compiler. Note that this can take quite some time because many of the packages may be build from source code during the installation. This allows Spack packages to be highly specific to your system.
spack load kokkos
Now, I can start creating my first Kokkos example. I create a file called testkokkos.cpp
and add some imports to the top.
#include <Kokkos_Core.hpp>
#include <chrono>
The Kokkos_Core.hpp
import is needed to use Kokkos and I also included chrono
from the standard library to allow timing the code. Kokkos introduces execution spaces and memory spaces. Data can live on the host machine or the accelerator device. Similarly, execution of the code can be carried out on the serial CPU or the GPU or some other specialised processor. I want my code to be compiled for two different settings so that I can compare GPU performance against the CPU. I define two types Execution
and Memory
for the execution and memory spaces. These types will depend on an external macro that will be passed in by the build system.
#ifdef TEST_USE_GPU
typedef Kokkos::Cuda Execution;
typedef Kokkos::CudaSpace Memory;
#else
typedef Kokkos::Serial Execution;
typedef Kokkos::HostSpace Memory;
#endif
Kokkos manages data in View
objects which represent multi-dimensional arrays. View
has some template arguments. The first argument specifies the dimensionality and the datatype stored in the array. Further template arguments can be specified to specify the memory space and other compile-time configurations. For example Kokkos::View<double**, Kokkos::HostSpace>
defines a two dimensional array of double precision values on the host memory. To iterate over a view, one needs to define function objects that will be passed to a function that Kokkos calls “parallel dispatch”. The following code defines three such structs that can be used to instantiate function objects.
#define HALF_SIZE 500
struct Set {
Kokkos::View<double**, Memory> in;
KOKKOS_INLINE_FUNCTION void operator()(int i, int j) const {
in(i, j) = i==HALF_SIZE && j==HALF_SIZE ? 1.0 : 0.0;
}
};
struct Diffuse {
Kokkos::View<double**, Memory> in;
Kokkos::View<double**, Memory> out;
KOKKOS_INLINE_FUNCTION void operator()(int i, int j) const {
out(i, j) = in(i, j) + 0.1*(in(i-1, j) + in(i+1, j) + in(i, j-1) + in(i, j+1) - 4.0*in(i, j));
}
};
The Set
struct will initialise an array to 0.0 everywhere except for one position where it will be set to 1.0. This results in a single spike in the centre of the domain. Diffuse
applies a diffusion operator to the in
array and stored the result in the out
array. The calculations can’t be carried out in-place because the order in which the function objects are called may be arbitrary. This means that, after the diffusion operator has been applied, the values should be copied back from the out
array to the in
array.
Now that these function objects are defined, I can start writing the actual calculation.
void performCalculation() {
const int N = 2*HALF_SIZE + 1;
const int iter = 100;
Kokkos::View<double**, Memory> in("in", N, N);
Kokkos::View<double**, Memory> out("out", N, N);
Set set{in};
Diffuse diffuse{in, out};
The first two line in the function define some constants. N
is the size of the grids and iter
sets the number of times that the diffusion operator will be applied. The Kokkos::View
objects in
and out
store the 2-dimensional grids. The first template argument double **
specifies that the arrays are 2-dimensional and store double
values. The Memory
template argument was defined above and can either be Kokkos::CudaSpace
or Kokkos::HostSpace
. The last two lines in the code segment above initialise my two function objects of type Set
and Diffuse
.
I want to iterate over the inner domain, excluding the grid points at the edge of the arrays. This is necessary because the diffusion operator accesses the grid cells next to the position that is iterated over. The iteration policy uses the multidmensional range policy from Kokkos.
auto policy = Kokkos::MDRangePolicy<Execution, Kokkos::Rank<2>>({1, 1}, {N-1, N-1});
The Execution
template argument was defined above and can either be Kokkos::Cuda
or Kokkos::Serial
. The main calculation now looks like this.
Kokkos::parallel_for("Set", policy, set);
Kokkos::fence();
auto startTime = std::chrono::high_resolution_clock::now();
for (int i=0; i<iter; ++i)
{
Kokkos::parallel_for("Diffuse", policy, diffuse);
Kokkos::fence();
Kokkos::deep_copy(in, out);
}
auto endTime = std::chrono::high_resolution_clock::now();
auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
std::cerr << "Wall clock: " << milliseconds << " ms" << std::endl;
}
The function Kokkos::parallel_for
applies the function object for each element given by the iteration policy. Depending on the execution space, the calculations are performed on the CPU or the GPU. To set up the calculation the set
operator is applied. Inside the main loop, the diffusion
operator is applied, followed by a Kokkos::deep_copy
which copies the out
array back to the in
array. Notice, that I surrounded the loop by calls to STL’s high_resolution_clock::now()
. This will allow me to print out the wall-clock time used by the calculation and give me some indication of the performance of the code.
The main function now looks like this.
int main(int argc, char **argv) {
Kokkos::initialize(argc, argv);
performCalculation();
Kokkos::finalize_all();
return 0;
}
It is important to initialise Kokkos before any calls to its routines are called, and also to finalise it before the program is exited.
To compile the code, I use CMake. Kokkos provides the necessary CMake configuration files and Spack sets all the paths so that Kokkos is easily found. My CMakeLists.txt
file looks like this.
cmake_minimum_required(VERSION 3.10)
project(testkokkos LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
find_package(Kokkos REQUIRED PATHS ${KOKKOS_DIR})
# find_package(CUDA REQUIRED)
add_executable(testkokkosCPU testkokkos.cpp)
target_compile_definitions(testkokkosCPU PRIVATE KOKKOS_DEPENDENCE)
target_link_libraries(testkokkosCPU PRIVATE Kokkos::kokkoscore)
add_executable(testkokkosGPU testkokkos.cpp)
target_compile_definitions(testkokkosGPU PRIVATE TEST_USE_GPU KOKKOS_DEPENDENCE)
target_link_libraries(testkokkosGPU PRIVATE Kokkos::kokkoscore)
Running cmake .
followed by make
will build two targets, testkokkosCPU
and testkokkosGPU
. In the compilation of testkokkosGPU
the TEST_USE_GPU
macro has been set so that it will make use of the Cuda memory and execution spaces.
Now its time to compare. Running testkokkosCPU
writes out
Wall clock: 9055 ms
Running it on the GPU with testkokkosGPU
gives me
Wall clock: 57 ms
That’s right! On my NVidia Quadro P1000, the GPU accelerated code outperforms the serial version by a factor of more than 150.
Using Kokkos is a relatively easy and very flexible way to make use of GPUs and other accelerator architectures. As a developer of simulation codes, the use of function operators may seem a bit strange at first. From a computer science perspective, on the other hand, these operators feel like a very natural way of approaching the problem.
My next task now is to integrate the Kokkos approach with Schnek’s multidimensional grids and fields. Fortunately, Schnek provides template arguments that allow fine control over the memory model and this should allow replacing Schnek’s default memory model with Kokkos views.