Copyright © 2012, 2013, 2014 University of Cape Town
Copyright © 2014, 2015 Bruce Merry
Table of Contents
CLOGS is a library for higher-level operations on top of the OpenCL C++ API. It is designed to integrate with other OpenCL code, including synchronization using OpenCL events.
Currently only three operations are supported: radix sorting, reduction, and exclusive scan. Radix sort supports all the unsigned integral types as keys, and all the built-in scalar and vector types suitable for storage in buffers as values. Scan supports all the integral types. It also supports vector types, which allows for limited multi-scan capabilities. Reduction supports all the built-in types, but the floating-point types are not tested.
At present CLOGS is only supported with GCC on GNU/Linux and with Visual C++ on Windows. The code for the library itself is portable, but some aspects of the build system and test suite require porting.
It has been tested on an NVIDIA GeForce 480 GTX, an AMD R9 270 and on a CPU using the AMD APP SDK. The Intel SDK for OpenCL Applications is not supported as it has numerous bugs. Other OpenCL implementations are expected to work, but they are untested and unlikely to have optimal performance.
The following dependencies are required to build CLOGS. Where a minimum version is given, this is just the minimum version that is tested; older versions may still work.
An OpenCL implementation and headers, including a recent
version of CL/cl.hpp
.
If you have compilation problems, it might be that your
vendor SDK is providing an old version. You can
download the latest from the registry. Note
that you should use the version for the latest version of OpenCL
(currently 1.2), even if you are targeting an older
OpenCL version.
A C++ compiler. GCC 4.8 and Visual C++ 2013 have been tested.
The Boost headers. No dynamic libraries are needed to use the CLOGS library, but the boost::program_options library is needed for the CLOGS test suite and benchmark application. Boost headers are only required when building the library, not when building code against it.
Doxygen is required to build the reference documentation (optional).
xsltproc is required to build the user manual (optional). It is best to also install the DocBook XSL stylesheets so that they will be sourced locally rather than from the internet, which can be extremely slow.
Python 2.7.
CppUnit is required to build and run the test suite (optional).
CLOGS uses the waf build system. The build system is included in the distribution, so you do not need to download it separately.
The first step is to configure the build. This is done by running
$
python waf configure
This will check that the necessary headers can be found. OpenCL headers are not always installed in the normal include paths. You can explicitly specify the location for them by running
$
python waf configure --cl-headers=
path
Note that
should
be the directory containing the path
CL
directory. If other libraries also need to be added to the
include or link paths, you should use compiler-specific
environment variables, such as INCLUDE
and
LIBPATH
for MSVC or CPATH
and
LIBRARY_PATH
for GCC.
You can control where CLOGS will be installed by using
--prefix
, as well as the other standard
autoconf options (run waf
--help
for a list).
The configuration will automatically detect whether
doxygen and
xsltproc are present. However, you
can disable them with --without-doxygen
and
--without-xsltproc
.
If you need to debug into CLOGS, you can pass
--variant=debug
at configuration time to
create a debug build, or --variant=optimized
to create an optimized build with debug symbols.
Installation is done by running python waf
install
. Unless you have changed the
installation directory, you will probably need to be root to do
this, and you will also need to run
ldconfig
afterwards on a GNU system.
waf also supports a
--destdir
option that allows the entire
directory tree to be placed into a subdirectory rather than
the root of the filesystem. This is intended for use with
package management tools.
After installing, it is necessary to tune CLOGS for your OpenCL devices. See the next section for details.
CLOGS uses autotuning to choose good tuning parameters for each device. In previous versions of CLOGS one had to run a separate clogs-tune command to do autotuning in advance, but current versions do autotuning on-demand.
On UNIX-like systems, the results of autotuning are stored in
by default. On Windows, they are stored in the local
application data directory. You can override the setting by
setting $HOME
/.cache/clogs/cache.sqliteCLOGS_CACHE_DIR
, or on UNIX-like systems
by setting XDG_CACHE_HOME
. You can delete this
database to force retuning, or to clean out stale tuning
results (for example, for older versions of drivers) that are
taking up space.
Table of Contents
We start with a simple example showing how to do a scan
(prefix sum) operation. This operation is typically
combined with an algorithm that produces a variable amount of
output per work item, to allocate positions in an output
buffer. In a first pass, each work-item writes the number of
output elements to a buffer, which we'll call
counts
. A prefix sum is then run over
counts
, which replaces each value with
the sum of all values strictly before it. The second pass of
the algorithm then uses the values in
counts
as the initial positions to start
writing output to another buffer.
In the sample code, we assume that an OpenCL context, device and
command queue have already been created using the C++
bindings, and that the counts
array has
already been written.
#include <clogs/clogs.h> ... clogs::ScanProblem problem; problem.setType(clogs::TYPE_UINT); clogs::Scan scanner(context, device, problem); ... scanner.enqueue(queue, inBuffer, outBuffer, numElements);
The above code starts by specifying the variation of the
algorithm: in this case, the type of elements to scan will be
cl_uint. It then creates an object (scanner
)
that can be used to
perform scans. The constructor handles loading of the
internal kernels and allocation of internal storage. These
objects are quite slow to create and should be reused where
possible.
The last line enqueues to queue
the kernel
launches needed to perform the scan. There are several
optional parameters we have not shown. These include a vector
of events to wait for, and an output parameter which returns
an event that is signaled when the scan is complete. These
work in the same way as the other enqueuing functions in
the OpenCL C++ API, and allow scans to be combined with other
kernel launches in a dependency graph.
This is an out-of-place scan, where the output is stored separately from the input. It is also possible to a scan in-place by passing the same value for both buffers.
Now let us look at a slightly more complex sorting example.
Suppose we have unsigned 20-bit keys (stored in
uints), and float4
values. In CLOGS, keys and values must be held in separate
buffers (which reduces overall bandwidth), so let us say that
keys are in keys
and values in
values
. Furthermore, let us suppose that the
vector wait
contains a list of events for
work that will generate the keys and values, and that we want
an event that will be signaled when the sort is complete. Then
the code we want is
#include <clogs/clogs.h> ... clogs::RadixsortProblem problem; problem.setKeyType(clogs::TYPE_UINT); problem.setValueType(clogs::Type(clogs::TYPE_FLOAT, 4)); clogs::Radixsort sorter(context, device, problem); sorter.enqueue(queue, keys, values, numElements, 20, &wait, &event);
Notice that we used clogs::Type to specify the type of the values. This is a C++ type that encodes an OpenCL built-in type. In this case we specified the base type (float) and vector length (4). We could also do this for the key type, but it was not necessarily since there is an implicit conversion from clogs::TYPE_UINT to clogs::Type.
Also notice that we explicitly specified how many bits are used in the key. This parameter is optional (passing zero is the same as not passing it), but specifying it may reduce the number of passes and hence improve performance.
Both of the classes we have covered so far have additional features, which are described in the reference manual.
For a complete example of using the API, refer to
tools/clogs-benchmark.cpp
, which is a tool
for benchmarking the performance of CLOGS. This tool
is also installed when you install CLOGS, so you can obtain
estimates of sorting performance from the command-line.
The classes in this API (clogs::Scan, clogs::Reduce and clogs::Radixsort) store internal state that is used by the enqueued work. There are two limitations on reentrance:
It is not safe for two host threads to call the
enqueue
method on the same object
at the same time.
It is not safe for the work enqueued by two calls to
enqueue
on the same object to be
executed at the same time. Thus, if two calls specify
different command queues or specify the same queue but
it is in out-of-order execution mode, then events
or other synchronization primitives need to be used to
ensure that the work does not overlap.
The examples all show the interfaces using the classes defined in the OpenCL C++ bindings, such as cl::CommandQueue. However, interfaces are also provided that use the plain C bindings e.g., cl_command_queue. Using the C bindings may be very slightly faster (on the CPU) since it avoids some extra reference counting.
Errors are reported exclusively via exceptions. OpenCL errors
are reported with exceptions of type
clogs::Error. If
__CL_ENABLE_EXCEPTIONS is defined, then this
is a typedef of cl::Error; otherwise, it is a
type with the same interface, which inherits from
std::runtime_error. Additionally, the reference
documentation lists some higher-level conditions which
will be signaled by throwing clogs::Error (for
example, if numElements
was
too large and would have overflowed the buffer).
Out-of-memory conditions may be reported either with clogs::Error (if it is the OpenCL implementation that failed to allocate memory) or std::bad_alloc (if it is CLOGS that fails to allocate memory).
The other type of exception that may occur is clogs::InternalError. This is only thrown for unexpected errors with the implementation, such as when the source for one of the kernels fails to compile. Errors related to the tuning cache are reported as clogs::CacheError (a subclass of clogs::InternalError).
Each object allocated through the API allocates a small
amount of OpenCL memory, whose size depends only on the arguments to
the constructor. Additionally, clogs::Radixsort
allocates temporary buffers as part of
enqueue
to hold partially-sorted copies
of the keys and values.
For some uses, it is desirable to avoid repeatedly allocating
and freeing these temporary buffers, and so it is possible for
the user to specify buffers to use by calling
setTemporaryBuffers
(see the reference
documentation for details).
The algorithm objects are non-copyable, to avoid accidently triggering expensive copies of OpenCL objects. However, they are default-constructible, swappable, and (if a C++11 compiler is used) movable.
The event returned by the various enqueue
commands is suitable for event ordering, but it does not work
well with OpenCL event profiling functions to determine how much
time is spent on the GPU. For this purpose, one should call
setEventCallback
on the
clogs::Radixsort or similar
object. The registered callback will be called once for each
CL command enqueued, passing the associated event. Note that
the callback is called during the enqueue
call, rather than when the event completes; it is up to you to
defer querying the profiling information until the event is
complete.
By default, tuning progress is reported to standard output. In some cases one might want to redirect the output (for example, to a log file) or suppress it entirely. One can control the output stream and the verbosity using a clogs::TunePolicy. Here is an example:
#include <clogs/clogs.h> ... clogs::TunePolicy policy; policy.setOutput(std::cerr); policy.setVerbosity(clogs::TUNE_VERBOSITY_TERSE); clogs::RadixsortProblem problem; problem.setTunePolicy(policy); ...
Refer to the reference manual for the possible verbosity levels. If more control is required over how the progress report is handled, you can implement a custom output stream type (Boost.Iostreams greatly simplifies this).
It is also possible to disable on-the-fly tuning, by calling
setEnable
on the policy. If the problem
configuration has not already been tuned, attempting to
construct the algorithm object will throw a
clogs::CacheError instead of doing tuning.
While the code is heavily optimized, CLOGS has had relatively little device-specific performance tuning. The radix sort has been tuned on NVIDIA Fermi and GCN architectures, and the scan has only be tuned on Fermi. It is not yet as fast as some CUDA implementations. The graph below gives an indication of sorting rates. Note that the Y axis is the rate, not the time: it needs a lot of input to achieve maximum throughput.
Table of Contents
Copyright (c) 2012-2014 University of Cape Town Copyright (c) 2014 Bruce Merry
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This software includes SQLite 3.8.5, which is in the public domain. For details, refer to the website.