CLOGS User Manual

Table of Contents

1. Introduction
2. Installation
2.1. Requirements
2.2. Compiling CLOGS
2.3. Installing CLOGS
2.4. Autotuning
3. Using CLOGS
3.1. Examples
3.2. Benchmark application
3.3. Reentrance
3.4. Avoiding the OpenCL C++ bindings
3.5. Error handling
3.6. Memory management
3.7. Profiling
3.8. Tuning policy
4. Performance
5. License
5.1. SQLite 3

Chapter 1. Introduction

CLOGS is a library for higher-level operations on top of the OpenCL C++ API. It is designed to integrate with other OpenCL code, including synchronization using OpenCL events.

Currently only three operations are supported: radix sorting, reduction, and exclusive scan. Radix sort supports all the unsigned integral types as keys, and all the built-in scalar and vector types suitable for storage in buffers as values. Scan supports all the integral types. It also supports vector types, which allows for limited multi-scan capabilities. Reduction supports all the built-in types, but the floating-point types are not tested.

Chapter 2. Installation

2.1. Requirements

At present CLOGS is only supported with GCC on GNU/Linux and with Visual C++ on Windows. The code for the library itself is portable, but some aspects of the build system and test suite require porting.

It has been tested on an NVIDIA GeForce 480 GTX, an AMD R9 270 and on a CPU using the AMD APP SDK. The Intel SDK for OpenCL Applications is not supported as it has numerous bugs. Other OpenCL implementations are expected to work, but they are untested and unlikely to have optimal performance.

The following dependencies are required to build CLOGS. Where a minimum version is given, this is just the minimum version that is tested; older versions may still work.

  • An OpenCL implementation and headers, including a recent version of CL/cl.hpp. If you have compilation problems, it might be that your vendor SDK is providing an old version. You can download the latest from the registry. Note that you should use the version for the latest version of OpenCL (currently 1.2), even if you are targeting an older OpenCL version.

  • A C++ compiler. GCC 4.8 and Visual C++ 2013 have been tested.

  • The Boost headers. No dynamic libraries are needed to use the CLOGS library, but the boost::program_options library is needed for the CLOGS test suite and benchmark application. Boost headers are only required when building the library, not when building code against it.

  • Doxygen is required to build the reference documentation (optional).

  • xsltproc is required to build the user manual (optional). It is best to also install the DocBook XSL stylesheets so that they will be sourced locally rather than from the internet, which can be extremely slow.

  • Python 2.7.

  • CppUnit is required to build and run the test suite (optional).

2.2. Compiling CLOGS

CLOGS uses the waf build system. The build system is included in the distribution, so you do not need to download it separately.

The first step is to configure the build. This is done by running

$ python waf configure

This will check that the necessary headers can be found. OpenCL headers are not always installed in the normal include paths. You can explicitly specify the location for them by running

$ python waf configure --cl-headers=path

Note that path should be the directory containing the CL directory. If other libraries also need to be added to the include or link paths, you should use compiler-specific environment variables, such as INCLUDE and LIBPATH for MSVC or CPATH and LIBRARY_PATH for GCC.

You can control where CLOGS will be installed by using --prefix, as well as the other standard autoconf options (run waf --help for a list).

The configuration will automatically detect whether doxygen and xsltproc are present. However, you can disable them with --without-doxygen and --without-xsltproc.

If you need to debug into CLOGS, you can pass --variant=debug at configuration time to create a debug build, or --variant=optimized to create an optimized build with debug symbols.

2.3. Installing CLOGS

Installation is done by running python waf install. Unless you have changed the installation directory, you will probably need to be root to do this, and you will also need to run ldconfig afterwards on a GNU system.

waf also supports a --destdir option that allows the entire directory tree to be placed into a subdirectory rather than the root of the filesystem. This is intended for use with package management tools.

After installing, it is necessary to tune CLOGS for your OpenCL devices. See the next section for details.

2.4. Autotuning

CLOGS uses autotuning to choose good tuning parameters for each device. In previous versions of CLOGS one had to run a separate clogs-tune command to do autotuning in advance, but current versions do autotuning on-demand.

On UNIX-like systems, the results of autotuning are stored in $HOME/.cache/clogs/cache.sqlite by default. On Windows, they are stored in the local application data directory. You can override the setting by setting CLOGS_CACHE_DIR, or on UNIX-like systems by setting XDG_CACHE_HOME. You can delete this database to force retuning, or to clean out stale tuning results (for example, for older versions of drivers) that are taking up space.

Chapter 3. Using CLOGS

3.1. Examples

We start with a simple example showing how to do a scan (prefix sum) operation. This operation is typically combined with an algorithm that produces a variable amount of output per work item, to allocate positions in an output buffer. In a first pass, each work-item writes the number of output elements to a buffer, which we'll call counts. A prefix sum is then run over counts, which replaces each value with the sum of all values strictly before it. The second pass of the algorithm then uses the values in counts as the initial positions to start writing output to another buffer.

In the sample code, we assume that an OpenCL context, device and command queue have already been created using the C++ bindings, and that the counts array has already been written.

#include <clogs/clogs.h>
clogs::ScanProblem problem;
clogs::Scan scanner(context, device, problem);
scanner.enqueue(queue, inBuffer, outBuffer, numElements);

The above code starts by specifying the variation of the algorithm: in this case, the type of elements to scan will be cl_uint. It then creates an object (scanner) that can be used to perform scans. The constructor handles loading of the internal kernels and allocation of internal storage. These objects are quite slow to create and should be reused where possible.

The last line enqueues to queue the kernel launches needed to perform the scan. There are several optional parameters we have not shown. These include a vector of events to wait for, and an output parameter which returns an event that is signaled when the scan is complete. These work in the same way as the other enqueuing functions in the OpenCL C++ API, and allow scans to be combined with other kernel launches in a dependency graph.

This is an out-of-place scan, where the output is stored separately from the input. It is also possible to a scan in-place by passing the same value for both buffers.

Now let us look at a slightly more complex sorting example. Suppose we have unsigned 20-bit keys (stored in uints), and float4 values. In CLOGS, keys and values must be held in separate buffers (which reduces overall bandwidth), so let us say that keys are in keys and values in values. Furthermore, let us suppose that the vector wait contains a list of events for work that will generate the keys and values, and that we want an event that will be signaled when the sort is complete. Then the code we want is

#include <clogs/clogs.h>
clogs::RadixsortProblem problem;
problem.setValueType(clogs::Type(clogs::TYPE_FLOAT, 4));
clogs::Radixsort sorter(context, device, problem);
sorter.enqueue(queue, keys, values, numElements, 20, &wait, &event);

Notice that we used clogs::Type to specify the type of the values. This is a C++ type that encodes an OpenCL built-in type. In this case we specified the base type (float) and vector length (4). We could also do this for the key type, but it was not necessarily since there is an implicit conversion from clogs::TYPE_UINT to clogs::Type.

Also notice that we explicitly specified how many bits are used in the key. This parameter is optional (passing zero is the same as not passing it), but specifying it may reduce the number of passes and hence improve performance.

Both of the classes we have covered so far have additional features, which are described in the reference manual.

3.2. Benchmark application

For a complete example of using the API, refer to tools/clogs-benchmark.cpp, which is a tool for benchmarking the performance of CLOGS. This tool is also installed when you install CLOGS, so you can obtain estimates of sorting performance from the command-line.

3.3. Reentrance

The classes in this API (clogs::Scan, clogs::Reduce and clogs::Radixsort) store internal state that is used by the enqueued work. There are two limitations on reentrance:

  1. It is not safe for two host threads to call the enqueue method on the same object at the same time.

  2. It is not safe for the work enqueued by two calls to enqueue on the same object to be executed at the same time. Thus, if two calls specify different command queues or specify the same queue but it is in out-of-order execution mode, then events or other synchronization primitives need to be used to ensure that the work does not overlap.

3.4. Avoiding the OpenCL C++ bindings

The examples all show the interfaces using the classes defined in the OpenCL C++ bindings, such as cl::CommandQueue. However, interfaces are also provided that use the plain C bindings e.g., cl_command_queue. Using the C bindings may be very slightly faster (on the CPU) since it avoids some extra reference counting.

3.5. Error handling

Errors are reported exclusively via exceptions. OpenCL errors are reported with exceptions of type clogs::Error. If __CL_ENABLE_EXCEPTIONS is defined, then this is a typedef of cl::Error; otherwise, it is a type with the same interface, which inherits from std::runtime_error. Additionally, the reference documentation lists some higher-level conditions which will be signaled by throwing clogs::Error (for example, if numElements was too large and would have overflowed the buffer).

Out-of-memory conditions may be reported either with clogs::Error (if it is the OpenCL implementation that failed to allocate memory) or std::bad_alloc (if it is CLOGS that fails to allocate memory).

The other type of exception that may occur is clogs::InternalError. This is only thrown for unexpected errors with the implementation, such as when the source for one of the kernels fails to compile. Errors related to the tuning cache are reported as clogs::CacheError (a subclass of clogs::InternalError).

3.6. Memory management

Each object allocated through the API allocates a small amount of OpenCL memory, whose size depends only on the arguments to the constructor. Additionally, clogs::Radixsort allocates temporary buffers as part of enqueue to hold partially-sorted copies of the keys and values.

For some uses, it is desirable to avoid repeatedly allocating and freeing these temporary buffers, and so it is possible for the user to specify buffers to use by calling setTemporaryBuffers (see the reference documentation for details).

The algorithm objects are non-copyable, to avoid accidently triggering expensive copies of OpenCL objects. However, they are default-constructible, swappable, and (if a C++11 compiler is used) movable.

3.7. Profiling

The event returned by the various enqueue commands is suitable for event ordering, but it does not work well with OpenCL event profiling functions to determine how much time is spent on the GPU. For this purpose, one should call setEventCallback on the clogs::Radixsort or similar object. The registered callback will be called once for each CL command enqueued, passing the associated event. Note that the callback is called during the enqueue call, rather than when the event completes; it is up to you to defer querying the profiling information until the event is complete.

3.8. Tuning policy

By default, tuning progress is reported to standard output. In some cases one might want to redirect the output (for example, to a log file) or suppress it entirely. One can control the output stream and the verbosity using a clogs::TunePolicy. Here is an example:

#include <clogs/clogs.h>
clogs::TunePolicy policy;
clogs::RadixsortProblem problem;

Refer to the reference manual for the possible verbosity levels. If more control is required over how the progress report is handled, you can implement a custom output stream type (Boost.Iostreams greatly simplifies this).

It is also possible to disable on-the-fly tuning, by calling setEnable on the policy. If the problem configuration has not already been tuned, attempting to construct the algorithm object will throw a clogs::CacheError instead of doing tuning.

Chapter 4. Performance

While the code is heavily optimized, CLOGS has had relatively little device-specific performance tuning. The radix sort has been tuned on NVIDIA Fermi and GCN architectures, and the scan has only be tuned on Fermi. It is not yet as fast as some CUDA implementations. The graph below gives an indication of sorting rates. Note that the Y axis is the rate, not the time: it needs a lot of input to achieve maximum throughput.

Chapter 5. License

Table of Contents

5.1. SQLite 3

Copyright (c) 2012-2014 University of Cape Town Copyright (c) 2014 Bruce Merry

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


5.1. SQLite 3

This software includes SQLite 3.8.5, which is in the public domain. For details, refer to the website.