// NetBeans OpenCL Pack

Since I am doing a lot with OpenCL lately I decided to try to improve the tooling around OpenCL a bit. A weekend later the NetBeans OpenCL Pack was born :).

Features Including:

  • OpenCL Editor with syntax highlighting code completion and CL reference pages integration
  • OpenCL compiler integration
  • In-editor annotations of compiler warnings and errors updated as you type
  • JOCL project template

Technical Details:

The editor uses ANTLR as parser and lexer. This allows such simple things like keyword highlighting and also more complex features like semantic highlighting, formatting and auto completion (formatting is not yet implemented). It can also detect and report syntax errors, however this feature is automatically disabled if an OpenCL compiler is present on the host system. All with help of JOCL detected OpenCL implementations can be used as compiler backend.

Instead of using the old OpenGL Pack as template I decided to write it from scratch using latest NetBeans 7 and Java 7 APIs. So you will have to start NB with JDK7 to be able to use it.

Download

you can download it from the NetBeans plugin portal [mirror], sourcecode is on github

feedback and/or contributions/bugreports are as always appreciated

Screenshots:

auto completion editor project templates

have fun!


// Many little improvements made it into JOCL recently

Ok some of them are big, but I will only cover the little things with this blog entry :).

CLKernel

I added multiple utility methods to CLKernel and related classes. It is for example now possible to create a kernel and set its arguments in one line.


CLKernel sha512 = program.createCLKernel("sha512", padBuffer, digestBuffer, rangeBuffer);

Thanks to feedback in the jocl forums I also added methods to set vector typed arguments directly. In past you could do this only by setting them via a java.util.Buffer.


kernel.setArg(index, x, y, z, w);

Another small feature of CLKernel is to enforce 32bit arguments. You may want to switch between single and double floatingpoint precision at runtime or mix between both to improve performance you will have to compile the program with the double FP extension enabled. By setting kernel.setForce32bitArgs(true) all java doubles used as kernel arguments will be automatically cast down to 32bit CL floats (see MultiDeviceFractal demo for a example). This is nothing special but might safe you several if(single){setArg((float)foo)}else{setArg(foo)} constructs.

CLWork

CLKernel still only represents the function in the OpenCL program you want to call - nothing more. The new CLWork object contains everything required for kernel execution, like the NDRange and the kernel itself.


    int size = buffer.getNIOCapacity();
    CLWork1D work = CLWork.create1D(program.createCLKernel("sum", buffer, size));
    work.setWorkSize(size, 1).optimizeFor(device);

    // execute
    queue.putWriteBuffer(buffer, false)
         .putWork(work)
         .putReadBuffer(buffer, true);

optimizeFor(device) adjusts the workgroup size to meet device specific recommended values. This should make sure that all computing units of your GPU are used by dividing the work into groups (however this only works if your task does not care about the workgroup size, see javadoc).

CLSubDevice

Sometimes you don't want to put your CLDevice under 100% load. This might be the case for example if your device is the CPU your application is running on or if you have to share the GPU with an OpenGL context for rendering. One easy way of controlling device load is to limit the amount of compute units used for a task.


    CLPlatform platform = CLPlatform.getDefault(version(CL_1_1), type(CPU));

    CLDevice devices = platform.getMaxFLOPSDevice(type(CPU));
    CLSubDevice[] subs = device.createSubDevicesByCount(4, 4);
    // array contains now two virtual devices containing four CPU cores each

    CLContext context = CLContext.create(subs);
    CLCommandQueue queue = subs[0].createCommandQueue();
    ...

CLSubDevices extends CLDevice and can be used for context creation, queue creation and everywhere you would use the CLDevice. Prior to creating subdevices you should check if device.isFissionSupported() returns true.

CLProgram builder

Ok, this utility is not that new but I haven't blogged about it yet. If program.build() isn't enough you should take a look at the program builder. CLBuildConfiguration stores everything which is needed for program compilation and is easily configurable via the builder pattern :).


        // reusable builder
        CLBuildConfiguration builder = CLProgramBuilder.createConfiguration()
                                     .withOption(ENABLE_MAD)
                                     .forDevices(context.getDevices())
                                     .withDefine("RADIUS", 5)
                                     .withDefine("ENABLE_FOOBAR");
        builder.build(programA);
        builder.build(programB);
        ...

CLBuildConfiguration is fully reusable and can be upgraded to CLProgramConfiguration if you combine it with a CLProgram. Both can be serialised which allows to store the build configuration or the entire prebuild program on disc or send it over the network. (caching binaries on disc can safe startup time for example)


        // program configuration
        ois = new ObjectInputStream(new FileInputStream(file));
        CLProgramConfiguration programConfig = CLProgramBuilder.loadConfiguration(ois, context);
        assertNotNull(programConfig.getProgram());
        ois.close();
        program = programConfig.build(); // builds from source or loads binaries if possible
        assertTrue(program.isExecutable());

Note: loading binaries and associating them with the right driver/device is currently not trivial with OpenCL. Even if everything works as intended it is still possible that the driver refuses the binaries for some reason (driver update...etc). Thats why its recommended to add the program source to the configuration before calling build() to allow a automatic rebuild as fallback.


        // another entry point for complex builds (prepare() returns CLProgramConfiguration)
        program.prepare().withOption(ENABLE_MAD).forDevice(context.getMaxFlopsDevice()).build();

(all snippets have been stolen from the junit tests)
I am sure I forgot something... but this should cover at least some of the incremental improvements. Expect a few more blog entries for the larger features soon.

- - - - - -
In other news: Nvidia released OpenCL 1.1 drivers, some of us thought this would never happen -> all major vendors (AMD, Intel, NV, IBM, ZiiLABS ..) support now OpenCL 1.1 (screenshot)

have fun!


// Developing with JOCL on AMD, Intel and Nvidia OpenCL platforms

One nice feature of OpenCL is that the platform abstraction was handled in the spec from the first day on. You can install all OpenCL drivers side by side and let the application choose at runtime, on which device and on which platform it should execute the kernels.

As of today there are three four vendors which provide OpenCL implementations for the desktop. AMD and Intel support the OpenCL 1.1 specification where Nvidia apparently tries to stick with 1.0 to encourage their customers to stick with CUDA ;-). [edit] And of course there is also Apple providing out-of-the box OpenCL 1.0 support in MacOS 10.6.

JOCL contains a small CLInfo utility which can be used to quickly verify OpenCL installations. Here is the output on my system (ubuntu was booted) having all three SDKs installed:

CL_PLATFORM_NAMEATI StreamNVIDIA CUDAIntel(R) OpenCL
CL_PLATFORM_VERSIONOpenCL 1.1 ATI-Stream-v2.2 (302)OpenCL 1.0 CUDA 4.0.1OpenCL 1.1 LINUX
CL_PLATFORM_PROFILEFULL_PROFILEFULL_PROFILEFULL_PROFILE
CL_PLATFORM_VENDORAdvanced Micro Devices, Inc.NVIDIA CorporationIntel(R) Corporation
CL_PLATFORM_ICD_SUFFIX_KHRAMDNVIntel
CL_PLATFORM_EXTENSIONS[cl_khr_icd, cl_amd_event_callback][cl_khr_icd, cl_khr_byte_addressable_store, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_gl_sharing][cl_khr_icd, cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_intel_printf, cl_khr_global_int32_extended_atomics, cl_ext_device_fission]
CL_DEVICE_NAMEIntel(R) Core(TM) i7 CPU 940 @ 2.93GHzGeForce GTX 295GeForce GTX 295Intel(R) Core(TM) i7 CPU 940 @ 2.93GHz
CL_DEVICE_TYPECPUGPUGPUCPU
CL_DEVICE_AVAILABLEtruetruetruetrue
CL_DEVICE_VERSIONOpenCL 1.1 ATI-Stream-v2.2 (302)OpenCL 1.0 CUDAOpenCL 1.0 CUDAOpenCL 1.1
CL_DEVICE_PROFILEFULL_PROFILEFULL_PROFILEFULL_PROFILEFULL_PROFILE
CL_DEVICE_ENDIAN_LITTLEtruetruetruetrue
CL_DEVICE_VENDORGenuineIntelNVIDIA CorporationNVIDIA CorporationIntel(R) Corporation
CL_DEVICE_EXTENSIONS[cl_amd_device_attribute_query, cl_khr_byte_addressable_store, cl_khr_int64_extended_atomics, cl_khr_local_int32_extended_atomics, cl_amd_fp64, cl_amd_printf, cl_khr_local_int32_base_atomics, cl_khr_int64_base_atomics, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics, cl_ext_device_fission][cl_khr_icd, cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics][cl_khr_icd, cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_nv_compiler_options, cl_nv_pragma_unroll, cl_nv_device_attribute_query, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_khr_global_int32_extended_atomics][cl_khr_byte_addressable_store, cl_khr_fp64, cl_khr_local_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_khr_global_int32_base_atomics, cl_khr_gl_sharing, cl_intel_printf, cl_khr_global_int32_extended_atomics, cl_ext_device_fission]
CL_DEVICE_MAX_COMPUTE_UNITS830308
CL_DEVICE_MAX_CLOCK_FREQUENCY2934124212422930
CL_DEVICE_VENDOR_ID40984318431832902
CL_DEVICE_OPENCL_C_VERSIONOpenCL C 1.1 com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info string [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info string [error: CL_INVALID_VALUE]OpenCL C 1.1
CL_DRIVER_VERSION2.0270.41.06270.41.061.1
CL_DEVICE_ADDRESS_BITS64323264
CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT8118
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR161116
CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT4114
CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG2112
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT4114
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE0112
CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR16com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]16
CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT8com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]8
CL_DEVICE_NATIVE_VECTOR_WIDTH_INT4com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]4
CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG2com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]2
CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF0com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]0
CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT4com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]4
CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE0com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]2
CL_DEVICE_MAX_WORK_GROUP_SIZE10245125121024
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS3333
CL_DEVICE_MAX_WORK_ITEM_SIZES[1024, 1024, 1024][512, 512, 64][512, 512, 64][1024, 1024, 1024]
CL_DEVICE_MAX_PARAMETER_SIZE4096435243521024
CL_DEVICE_MAX_MEM_ALLOC_SIZE10737418242348318722347008003154703360
CL_DEVICE_GLOBAL_MEM_SIZE322122547293932748893880320012618813440
CL_DEVICE_LOCAL_MEM_SIZE32768163841638432768
CL_DEVICE_HOST_UNIFIED_MEMORYtruecom.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]com.jogamp.opencl.CLException$CLInvalidValueException: error while asking for info value [error: CL_INVALID_VALUE]true
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE655366553665536131072
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE640064
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE3276800262144
CL_DEVICE_MAX_CONSTANT_ARGS899128
CL_DEVICE_IMAGE_SUPPORTfalsetruetruetrue
CL_DEVICE_MAX_READ_IMAGE_ARGS0128128128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS088128
CL_DEVICE_IMAGE2D_MAX_WIDTH0409640968192
CL_DEVICE_IMAGE2D_MAX_HEIGHT032768327688192
CL_DEVICE_IMAGE3D_MAX_WIDTH0204820482048
CL_DEVICE_IMAGE3D_MAX_HEIGHT0204820482048
CL_DEVICE_IMAGE3D_MAX_DEPTH0204820482048
CL_DEVICE_MAX_SAMPLERS01616128
CL_DEVICE_PROFILING_TIMER_RESOLUTION110001000340831
CL_DEVICE_EXECUTION_CAPABILITIES[EXEC_KERNEL, EXEC_NATIVE_KERNEL][EXEC_KERNEL][EXEC_KERNEL][EXEC_KERNEL, EXEC_NATIVE_KERNEL]
CL_DEVICE_HALF_FP_CONFIG[][][][]
CL_DEVICE_SINGLE_FP_CONFIG[DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO][INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA][INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA][DENORM, INF_NAN, ROUND_TO_NEAREST]
CL_DEVICE_DOUBLE_FP_CONFIG[][DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA][DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA][DENORM, INF_NAN, ROUND_TO_NEAREST, ROUND_TO_INF, ROUND_TO_ZERO, FMA]
CL_DEVICE_LOCAL_MEM_TYPEGLOBALLOCALLOCALGLOBAL
CL_DEVICE_GLOBAL_MEM_CACHE_TYPEREAD_WRITENONENONEREAD_WRITE
CL_DEVICE_QUEUE_PROPERTIES[PROFILING_MODE][OUT_OF_ORDER_MODE, PROFILING_MODE][OUT_OF_ORDER_MODE, PROFILING_MODE][OUT_OF_ORDER_MODE, PROFILING_MODE]
CL_DEVICE_COMPILER_AVAILABLEtruetruetruetrue
CL_DEVICE_ERROR_CORRECTION_SUPPORTfalsefalsefalsefalse
cl_khr_fp16falsefalsefalsefalse
cl_khr_fp64falsetruetruetrue
cl_khr_gl_sharing | cl_APPLE_gl_sharingtruetruetruetrue

The CLInfo utility is part of the jocl-demos project and is also available via webstart. For a plain text version of the above output you can run:

 java -jar jocl.jar:gluegen-rt.jar\
    -Djava.library.path="path/to/jocl/libs:path/to/gluegen/libs" com.jogamp.opencl.util.CLInfo

(btw to install the intel sdk on debian based systems follow this link)

happy coding!


// Java Binding for the OpenCL API

I am currently working on Java Binding for the OpenCL API using GlueGen (as used in JOGL, JOAL). The project started as part of my bachelor of CS thesis short after the release of the first OpenCL specification draft and is now fully feature complete with OpenCL 1.1. JOCL is currently in the stabilization phase, a beta release shouldn't be far away.

Overview - How does it work?

JOCL enables applications running on the JVM to use OpenCL for massively parallel, high performance computing tasks, executed on heterogeneous hardware (GPUs, CPUs, FPGAs etc) in a platform independent manner. JOCL consists of two parts, the low level and the high level binding.

The low level bindings (LLB) are automatically generated using the official OpenCL headers as input and provide a high performance, JNI based, 1:1 mapping to the C functions.

This has the following advantages:

  • reduces maintenance overhead and ensures spec conformance
  • compiletime JNI bindings are the fastest way to access native libs from the JVM
  • makes translating OpenCL C code into Java + JOCL very easy (e.g. from books or tutorials)
  • flexibility and stability: OpenCL libs are loaded dynamically and accessed via function pointers

The hand written high level bindings (HLB) is build on top of LLB and hides most boilerplate code (like object IDs, pointers and resource management) behind easy to use java objects. HLB use direct NIO buffers internally for fast memory transfers between the JVM and the OpenCL implementation and is very GC friendly. Most of the API is designed for method chaining but of course you don't have to use it this way if you don't want to. JOCL also seamlessly integrates with JOGL 2 (both are built and tested together). Just pass the JOGL context as parameter to the JOCL context factory and you will receive a shared context. If you already know OpenCL and Java, HLB should be very intuitive for you.

The project is available on jogamp.org. Please use the mailinglist / forum for feedback or questions and the bugtracker if you experience any issues. The JOCL root repository is located on github, you may also want to take a look at the jocl-demos project. (If the demos are not enough you might also want to take a look at the junit tests)

Screenshots (sourcecode in jocl-demos project):

JOCL Julia Set high precision

More regarding OpenGL interoperability and other features in upcoming blog entries.

The following sample shows basic setup, computation and cleanup using the high level APIs.

Hello World or parallel a+b=c


/**
 * Hello Java OpenCL example. Adds all elements of buffer A to buffer B
 * and stores the result in buffer C.
 * Sample was inspired by the Nvidia VectorAdd example written in C/C++
 * which is bundled in the Nvidia OpenCL SDK.
 * @author Michael Bien
 */
public class HelloJOCL {

    public static void main(String[] args) throws IOException {
        // Length of arrays to process (arbitrary number)
        int elementCount = 11444777;
        // Local work size dimensions
        int localWorkSize = 256;
        // rounded up to the nearest multiple of the localWorkSize
        int globalWorkSize = roundUp(localWorkSize, elementCount);

        // setup
        CLContext context = CLContext.create();

        CLProgram program = context.createProgram(
                       HelloJOCL.class.getResourceAsStream("VectorAdd.cl")
                                 ).build();

        CLBuffer<FloatBuffer> clBufferA =
                       context.createFloatBuffer(globalWorkSize, READ_ONLY);
        CLBuffer<FloatBuffer> clBufferB =
                       context.createFloatBuffer(globalWorkSize, READ_ONLY);
        CLBuffer<FloatBuffer> clBufferC =
                       context.createFloatBuffer(globalWorkSize, WRITE_ONLY);

        out.println("used device memory: "
            + (clBufferA.getSize()+clBufferB.getSize()+clBufferC.getSize())/1000000 +"MB");

        // fill read buffers with random numbers (just to have test data).
        fillBuffer(clBufferA.getBuffer(), 12345);
        fillBuffer(clBufferB.getBuffer(), 67890);

        // get a reference to the kernel functon with the name 'VectorAdd'
        // and map the buffers to its input parameters.
        CLKernel kernel = program.createCLKernel("VectorAdd");
        kernel.putArgs(clBufferA, clBufferB, clBufferC).putArg(elementCount);

        // create command queue on fastest device.
        CLCommandQueue queue = context.getMaxFlopsDevice().createCommandQueue();

        // asynchronous write to GPU device,
        // blocking read later to get the computed results back.
        long time = nanoTime();
        queue.putWriteBuffer(clBufferA, false)
             .putWriteBuffer(clBufferB, false)
             .put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)
             .putReadBuffer(clBufferC, true);
        time = nanoTime() - time;

        // cleanup all resources associated with this context.
        context.release();

        // print first few elements of the resulting buffer to the console.
        out.println("a+b=c results snapshot: ");
        for(int i = 0; i < 10; i++)
            out.print(clBufferC.getBuffer().get() + ", ");
        out.println("...; " + clBufferC.getBuffer().remaining() + " more");

        out.println("computation took: "+(time/1000000)+"ms");

    }

    private static final void fillBuffer(FloatBuffer buffer, int seed) {
        Random rnd = new Random(seed);
        while(buffer.remaining() != 0)
            buffer.put(rnd.nextFloat()*100);
        buffer.rewind();
    }

    private static final int roundUp(int groupSize, int globalSize) {
        int r = globalSize % groupSize;
        if (r == 0) {
            return globalSize;
        } else {
            return globalSize + groupSize - r;
        }
    }

}

VectorAdd.cl


    // OpenCL Kernel Function for element by element vector addition
    kernel void VectorAdd(global const float* a,
                          global const float* b,
                          global float* c, int numElements) {

        // get index into global data array
        int iGID = get_global_id(0);

        // bound check (equivalent to the limit on a 'for' loop)
        if (iGID >= numElements)  {
            return;
        }

        // add the vector elements
        c[iGID] = a[iGID] + b[iGID];
    }

// JogAmp at SIGGRAPH 2010

The JogAmp team will be present at SIGGRAPH this year:
3D & Multimedia Across Platforms and Devices Using JOGL
Tuesday, 27 July | 4:00 PM - 6:00 PM

This session discusses the features, contributions, and future of OpenGL, OpenCL, and OpenMax
across devices and OS exposed on top of Java using the JogAmp open-source libraries.
link to Session

hope to meet you there.

about JogAmp.
JogAmp is the home of high performance Java libraries for 3D Graphics, Multimedia and Processing. JogAmp consists currently of the projects JOGL, JOCL and JOAL which provide cross platform language bindings to the OpenGL, OpenCL, OpenAL and OpenMAX APIs.


- - - -
(yes i know i should start bogging again :))