// Many little improvements made it into JOCL recently

Ok some of them are big, but I will only cover the little things with this blog entry :).

CLKernel

I added multiple utility methods to CLKernel and related classes. It is for example now possible to create a kernel and set its arguments in one line.


CLKernel sha512 = program.createCLKernel("sha512", padBuffer, digestBuffer, rangeBuffer);

Thanks to feedback in the jocl forums I also added methods to set vector typed arguments directly. In past you could do this only by setting them via a java.util.Buffer.


kernel.setArg(index, x, y, z, w);

Another small feature of CLKernel is to enforce 32bit arguments. You may want to switch between single and double floatingpoint precision at runtime or mix between both to improve performance you will have to compile the program with the double FP extension enabled. By setting kernel.setForce32bitArgs(true) all java doubles used as kernel arguments will be automatically cast down to 32bit CL floats (see MultiDeviceFractal demo for a example). This is nothing special but might safe you several if(single){setArg((float)foo)}else{setArg(foo)} constructs.

CLWork

CLKernel still only represents the function in the OpenCL program you want to call - nothing more. The new CLWork object contains everything required for kernel execution, like the NDRange and the kernel itself.


    int size = buffer.getNIOCapacity();
    CLWork1D work = CLWork.create1D(program.createCLKernel("sum", buffer, size));
    work.setWorkSize(size, 1).optimizeFor(device);

    // execute
    queue.putWriteBuffer(buffer, false)
         .putWork(work)
         .putReadBuffer(buffer, true);

optimizeFor(device) adjusts the workgroup size to meet device specific recommended values. This should make sure that all computing units of your GPU are used by dividing the work into groups (however this only works if your task does not care about the workgroup size, see javadoc).

CLSubDevice

Sometimes you don't want to put your CLDevice under 100% load. This might be the case for example if your device is the CPU your application is running on or if you have to share the GPU with an OpenGL context for rendering. One easy way of controlling device load is to limit the amount of compute units used for a task.


    CLPlatform platform = CLPlatform.getDefault(version(CL_1_1), type(CPU));

    CLDevice devices = platform.getMaxFLOPSDevice(type(CPU));
    CLSubDevice[] subs = device.createSubDevicesByCount(4, 4);
    // array contains now two virtual devices containing four CPU cores each

    CLContext context = CLContext.create(subs);
    CLCommandQueue queue = subs[0].createCommandQueue();
    ...

CLSubDevices extends CLDevice and can be used for context creation, queue creation and everywhere you would use the CLDevice. Prior to creating subdevices you should check if device.isFissionSupported() returns true.

CLProgram builder

Ok, this utility is not that new but I haven't blogged about it yet. If program.build() isn't enough you should take a look at the program builder. CLBuildConfiguration stores everything which is needed for program compilation and is easily configurable via the builder pattern :).


        // reusable builder
        CLBuildConfiguration builder = CLProgramBuilder.createConfiguration()
                                     .withOption(ENABLE_MAD)
                                     .forDevices(context.getDevices())
                                     .withDefine("RADIUS", 5)
                                     .withDefine("ENABLE_FOOBAR");
        builder.build(programA);
        builder.build(programB);
        ...

CLBuildConfiguration is fully reusable and can be upgraded to CLProgramConfiguration if you combine it with a CLProgram. Both can be serialised which allows to store the build configuration or the entire prebuild program on disc or send it over the network. (caching binaries on disc can safe startup time for example)


        // program configuration
        ois = new ObjectInputStream(new FileInputStream(file));
        CLProgramConfiguration programConfig = CLProgramBuilder.loadConfiguration(ois, context);
        assertNotNull(programConfig.getProgram());
        ois.close();
        program = programConfig.build(); // builds from source or loads binaries if possible
        assertTrue(program.isExecutable());

Note: loading binaries and associating them with the right driver/device is currently not trivial with OpenCL. Even if everything works as intended it is still possible that the driver refuses the binaries for some reason (driver update...etc). Thats why its recommended to add the program source to the configuration before calling build() to allow a automatic rebuild as fallback.


        // another entry point for complex builds (prepare() returns CLProgramConfiguration)
        program.prepare().withOption(ENABLE_MAD).forDevice(context.getMaxFlopsDevice()).build();

(all snippets have been stolen from the junit tests)
I am sure I forgot something... but this should cover at least some of the incremental improvements. Expect a few more blog entries for the larger features soon.

- - - - - -
In other news: Nvidia released OpenCL 1.1 drivers, some of us thought this would never happen -> all major vendors (AMD, Intel, NV, IBM, ZiiLABS ..) support now OpenCL 1.1 (screenshot)

have fun!




Comments:

Post a Comment:
  • HTML Syntax: NOT allowed