CLOGS User Manual

Table of Contents

1. Introduction
2. Installation
2.1. Requirements
2.2. Compiling CLOGS
2.3. Installing CLOGS
2.4. Autotuning
3. Using CLOGS
3.1. Examples
3.2. Benchmark application
3.3. Reentrance
3.4. Error handling
3.5. Memory management
3.6. Profiling
4. Performance
5. License

Chapter 1. Introduction

CLOGS is a library for higher-level operations on top of the OpenCL C++ API. It is designed to integrate with other OpenCL code, including synchronization using OpenCL events.

Currently only two operations are supported: radix sorting and exclusive scan. Radix sort supports all the unsigned integral types as keys, and all the built-in scalar and vector types suitable for storage in buffers as values. Scan supports all the integral types. It also supports vector types, which allows for limited multi-scan capabilities.

Chapter 2. Installation

2.1. Requirements

At present CLOGS is only supported with GCC on GNU/Linux and with Visual C++ 2010 on Windows. The code for the library itself is portable, but some aspects of the build system and test suite require porting.

It has been tested on an NVIDIA GeForce 480 GTX, an AMD R9 270 and on a CPU using the AMD APP SDK. The Intel SDK for OpenCL Applications 2013 is not supported as it has numerous bugs. Other OpenCL implementations are expected to work, but they are untested and unlikely to have optimal performance.

The following dependencies are required to build CLOGS:

  • An OpenCL implementation and headers, including a recent version of CL/cl.hpp. If you have compilation problems, it might be that your vendor SDK is providing an old version. You can download the latest from the registry. Note that you should use the version for the latest version of OpenCL (currently 1.2), even if you are targeting an older OpenCL version.

  • A C++ compiler. GCC 4.8 and Visual C++ 2010 have been tested. Older versions should work fine too, but it is not tested regularly.

  • The Boost headers. No dynamic libraries are needed to use the CLOGS library, but the boost::program_options library is needed for the CLOGS test suite and benchmark application.

  • Doxygen is required to build the reference documentation (optional).

  • xsltproc is required to build the user manual (optional). It is best to also install the DocBook XSL stylesheets so that they will be sourced locally rather than from the internet, which can be extremely slow.

  • Python 2.6 or later.

  • CppUnit is required to build and run the test suite (optional).

2.2. Compiling CLOGS

CLOGS uses the waf build system. The build system is included in the distribution, so you do not need to download it separately.

The first step is to configure the build. This is done by running

$ python waf configure

This will check that the necessary headers can be found. OpenCL headers are not always installed in the normal include paths. You can explicitly specify the location for them by running

$ python waf configure --cl-headers=path

Note that path should be the directory containing the CL directory. If other libraries also need to be added to the include or link paths, you should use compiler-specific environment variables, such as INCLUDE and LIBPATH for MSVC or CPATH and LIBRARY_PATH for GCC.

You can control where CLOGS will be installed by using --prefix, as well as the other standard autoconf options (run waf --help for a list).

The configuration will automatically detect whether doxygen and xsltproc are present. However, you can disable them with --without-doxygen and --without-xsltproc.

If you need to debug into CLOGS, you can pass --variant=debug at configuration time to create a debug build, or --variant=optimized to create an optimized build with debug symbols.

2.3. Installing CLOGS

Installation is done by running python waf install. Unless you have changed the installation directory, you will probably need to be root to do this, and you will also need to run ldconfig afterwards on a GNU system.

waf also supports a --destdir option that allows the entire directory tree to be placed into a subdirectory rather than the root of the filesystem. This is intended for use with package management tools.

After installing, it is necessary to tune CLOGS for your OpenCL devices. See the next section for details.

2.4. Autotuning

CLOGS uses autotuning to choose good tuning parameters for each device. This helps with performance, but unfortunately also means that you need to run the autotuner before using CLOGS. This is done simply by running clogs-tune. You should do this on a system that is otherwise idle, to avoid skewing the results.

This can take a long time, particularly for CPU devices. If you have OpenCL CPU devices but don't intend to use them with CLOGS, you can save time by passing --cl-gpu to clogs-tune to only tune for GPU devices.

You need to do autotuning when you first install CLOGS or when you do an upgrade. You will also need to do autotuning when adding a new OpenCL device. The previous tuning is remembered, so this will only cause new configurations to be tuned. If for some reason you need to re-do tuning, you can pass --force to clogs-tune.

On POSIX systems, the results of autotuning are stored in $HOME/.clogs/cache. On Windows, they are stored in the local application data directory.

Chapter 3. Using CLOGS

3.1. Examples

We start with a simple example showing how to do a scan (prefix sum) operation. This operation is typically combined with an algorithm that produces a variable amount of output per work item, to allocate positions in an output buffer. In a first pass, each work-item writes the number of output elements to a buffer, which we'll call counts. A prefix sum is then run over counts, which replaces each value with the sum of all values strictly before it. The second pass of the algorithm then uses the values in counts as the initial positions to start writing output to another buffer.

In the sample code, we assume that an OpenCL context, device and command queue have already been created using the C++ bindings, and that the counts array has already been written.

#include <clogs/clogs.h>
clogs::Scan scanner(context, device, clogs::TYPE_UINT);
scanner.enqueue(queue, inBuffer, outBuffer, numElements);

The above code first allocates an object that can be used to perform scans. The constructor handles loading of the internal kernels and allocation of internal storage. These objects are quite slow to create and should be reused where possible.

The last line enqueues to queue the kernel launches needed to perform the scan. There are several optional parameters we have not shown. These include a vector of events to wait for, and an output parameter which returns an event that is signaled when the scan is complete. These work in the same way as the other enqueuing functions in the OpenCL C++ API, and allow scans to be combined with other kernel launches in a dependency graph.

This is an out-of-place scan, where the output is stored separately from the input. It is also possible to a scan in-place by passing the same value for both buffers.

Now let us look at a slightly more complex sorting example. Suppose we have unsigned 20-bit keys (stored in uints), and float4 values. In CLOGS, keys and values must be held in separate buffers (which reduces overall bandwidth), so let us say that keys are in keys and values in values. Furthermore, let us suppose that the vector wait contains a list of events for work that will generate the keys and values, and that we want an event that will be signaled when the sort is complete. Then the code we want is

#include <clogs/clogs.h>
clogs::Radixsort sorter(context, device, clogs::TYPE_UINT, clogs::Type(clogs::TYPE_FLOAT, 4));
sorter.enqueue(queue, keys, values, numElements, 20, &wait, &event);

Notice that we used clogs::Type to specify the type of the values. This is a C++ type that encodes an OpenCL built-in type. In this case we specified the base type (float) and vector length (4). We could also do this for the key type, but it was not necessarily since there is an implicit conversion from clogs::TYPE_UINT to clogs::Type.

Also notice that we explicitly specified how many bits are used in the key. This parameter is optional (passing zero is the same as not passing it), but specifying it may reduce the number of passes and hence improve performance.

Both of the classes we have covered so far have additional features, which are described in the reference manual.

3.2. Benchmark application

For a complete example of using the API, refer to tools/clogs-benchmark.cpp, which is a tool for benchmarking the performance of CLOGS. This tool is also installed when you install CLOGS, so you can obtain estimates of sorting performance from the command-line.

3.3. Reentrance

The classes in this API (clogs::Scan and clogs::Radixsort) store internal state that is used by the enqueued work. There are two limitations on reentrance:

  1. It is not safe for two host threads to call the enqueue method on the same object at the same time.

  2. It is not safe for the work enqueued by two calls to enqueue on the same object to be executed at the same time. Thus, if two calls specify different command queues or specify the same queue but it is in out-of-order execution mode, then events or other synchronization primitives need to be used to ensure that the work does not overlap.

3.4. Error handling

The source of CLOGS is compiled with __CL_ENABLE_EXCEPTIONS defined, so any error caused internally will be thrown as a cl::Error. Additionally, the reference documentation lists some higher-level conditions which will be signaled by throwing cl::Error (for example, if numElements was too large and would have overflowed the buffer).

The other type of exception that may occur is clogs::InternalError. This is only thrown for unexpected errors with the implementation, such as when the source for one of the kernels fails to compile. If autotuning has not been done, a clogs::CacheError (a subclass of clogs::InternalError) is thrown.

3.5. Memory management

Each object allocated through the API allocates a small amount of OpenCL memory, whose size depends only on the arguments to the constructor. Additionally, clogs::Radixsort allocates temporary buffers as part of enqueue to hold partially-sorted copies of the keys and values.

For some uses, it is desirable to avoid repeatedly allocating and freeing these temporary buffers, and so it is possible for the user to specify buffers to use by calling setTemporaryBuffers (see the reference documentation for details).

The scanning and sorting objects are non-copyable, to avoid accidently triggering expensive copies of OpenCL objects. They need to be passed by reference.

3.6. Profiling

The event returned by the various enqueue commands is suitable for event ordering, but it does not work well with OpenCL event profiling functions to determine how much time is spent on the GPU. For this purpose, one should call setEventCallback on the clogs::Radixsort or clogs::Scan object. The registered callback will be called once for each CL command enqueued, passing the associated event. Note that the callback is called during the enqueue call, rather than when the event completes; it is up to you to defer querying the profiling information until the event is complete.

Chapter 4. Performance

While the code is heavily optimized, CLOGS has had relatively little device-specific performance tuning. The radix sort has been tuned on NVIDIA Fermi and GCN architectures, and the scan has only be tuned on Fermi. It is not yet as fast as some CUDA sorting implementations. The graph below gives an indication of sorting rates. Note that the Y axis is the rate, not the time: it needs a lot of input to achieve maximum throughput.

Chapter 5. License

Copyright (c) 2012-2013 University of Cape Town

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.