OpenCL under Wine

The Wine 1.3 branch has support for OpenCL 1.0 since 1.3.9. Since Microsoft likes to get a little part of the Linux-dominated HPC-market, support for GPGPU is pretty good under the $799.00 costing Visual Studio – the free Express-version is not supported well. But why not take the produced software back via Wine? Problem is that OpenCL is not in the current Wine binaries for some reason, but that is fixable until we wait for inclusion…

Lazy or not much time? You can try my binaries (Ubuntu 32, NVIDIA), but I cannot guarantee they work for you and it is on your own responsibility: download (reported not working by some). See second part of step 3, what to do with it.

All the steps

I assume you have the OpenCL-SDK installed, but let me know if I need to add more details or clear up some steps.

1 – get the sources

The sources are available here. Be sure you download at least version 1.3.9. Alternatively you download the latest from git. You can get it by going to a directory and execute:

git clone git://source.winehq.org/git/wine.git

A directory “wine” will be created. That was easy, so lets go to bake some binaries.

Continue reading “OpenCL under Wine”

OpenCL mini buying guide for X86

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86”

Phoronix OpenCL Benchmark 3.0 beta

So you want OpenCL-benchmarks? Phoronix is a benchmark for OSX and Linux, created by Michael Larabel, Matthew Tippett (http://en.wikipedia.org/wiki/Phoronix_Test_Suite). On Ubuntu Phoronix version 2.8 is in the Ubuntu “app store” (Synaptic), but 3.0 has those nice OpenCL-tests. The tests are based on David Bucciarelli‘s OpenCL demos. Starting to use Phonornix 3.0 (beta 1) is done in 4 easy steps:

  1. Download the latest beta-version from http://www.phoronix-test-suite.com/?k=downloads
  2. Extract. Can be anywehre. I chose /opt/phoronix-test-suite
  3. Install. Just type ./phoronix-test-suite in a terminal
  4. Use.

WARNING: It is beta-software and the following might not work on your machine! If you have problems with this tutorial and want or found a fix, post a reply.

Continue reading “Phoronix OpenCL Benchmark 3.0 beta”

Windows on ARM

In 2010 Microsoft got interested in ARM, because of low-power solutions for server-parks. ARM tried to lobby for years to convince Microsoft to port Windows to their architecture and now the result is there. Let’s not look to the past, why they did not do it earlier and depended completely on Intel, AMD/ATI and NVIDIA. NB: This article is a personal opinion, to open up the conversation about Windows plus OpenCL.

While Google and Apple have taken their share on the ARM-OS market, Microsoft wants some too. A wise choice, but again late. We’ve seen how the Windows-PC market was targeted first from the cloud (run services in the browser on any platform) and Apple’s user-friendly eye-candy (A personal computer had to be distinguished from a dull working-machine), then from the smartphones and tablets (many users want e-mail and a browser, not sit behind their desk). MS’s responses were Azure (Cloud, Q1 2010), Windows 7 (OS with slick user-interface, Q3 2009), Windows Phone 7 (Smartphones, Q4 2010) and now Windows 8 (OS for X86 PCs and ARM tablets, 2012 or later).

Windows 8 for ARM will be made with assistance from NVIDIA, Qualcomm and Texas Instruments, according to their press-release [1]. They even demonstrated a beta of Windows 8 running Microsoft Office on ARM-hardware, so it is not just a promise.

How can Microsoft support this new platform and (for StreamHPC more important) what will the consequences be for OpenCL, CUDA and DirectCompute.

Continue reading “Windows on ARM”

NVIDIA’s answer to SandyBridge and Fusion

Intel has Sandy Bridge, AMD has Fusion, now NVIDIA has a combination of CPU and GPU too: Project Denver. The only difference is that it is not X86-based, but an ARM-architecture. And most-probable the most powerful ARM-GPU of 2011.

For years there were ARM-based Systems-on-a-chip: a CPU and a GPU combined (see list below). On the X86-platform the “integrated GPU” was on the motherboard, and since this year now both AMD/ATI and Intel hit this “new market”.The big advantage is that it’s cheaper to produce, is more powerful per Watt (in total) and has good acceleration-potential. NVIDIA does not have X86-chips and would have been the big loser of 2011; they did everything to reinvent themselves: 3D was reintroduced, CUDA was actively developed and pushed (free libraries and tools, university-programs, many books and trainings, Tesla, etc), a mobile Tegra graphics solution [1] (see image at the right), and all existing products got extra backing from the marketing-department. A great time for researchers who needed to get free products in exchange of naming NVIDIA in their research-reports.

NVIDIA chose for ARM; interesting for who is watching the CUDA-vs-OpenCL battle, since CUDA was for GPUs of NVIDIA on X86 and ARM was solely for OpenCL. Period. In the contrary to their other ARM-based chips, this new chip probably won’t be in smartphones (yet); it targets systems that need more GPU-power like CUDA and games.

In a few days the article about Windows-on-ARM is to be released, which completes this article.

Continue reading “NVIDIA’s answer to SandyBridge and Fusion”

Happy New Year!

About a year ago this site was launched and a half year ago StreamHPC as a company was official for the Chamber of Commerce. It has been a year of hard work, but the reason for this all started after seeing the cover of a book about bore-outs. The result is there with a growing number of visitors from all over the world (from 62 countries since 23-Dec-2010) and new twitter-followers every week. Now some mixed news for 2011:

  • We are soon going to release a few plugins for Eclipse, both free and paid, to simplify your development.
  • 2011 will be the year of hybrid processors (Intel SandyBridge and AMD Fusion), which will make OpenCL much more popular.
  • 2011 is also going to be the year of the smart-phone (prognosis: in 2011 more smart-phones will be sold than PCs). So even more OpenCL-potential.
  • At 31-Dec-2010 we migrated the site to a faster server to reduce waiting-time also online.
  • The book will be released in parts, to avoid more delays.
  • There will be around ten (short) articles published in January. Both developers and managers will be served.
  • Our goal is to expand. We have shown you our vision, but we want to show you more.

In a few words: 2011 is going to be exciting! We wish all our readers, business-partners, friends, family and (new) customers a super-accelerated 2011!

StreamHPC – we accelerate your computations

DirectCompute’s unpopularity

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity”

OpenCL Fireworks

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Continue reading “OpenCL Fireworks”

http://www.flickr.com/photos/imabug/2946930401/

OpenCL Potentials: Medical Imaging

Photo by Eugene MahWhen you ever saw a CT or MRI scanner, you might have noticed the full-sized computer next to it (especially the older ones). There is quite some processing power needed to keep up with the data-stream coming from the scanner, to process the data to a 3D-image and to visualise the data on a 2D-screen. Luckily we have OpenCL to make it even faster; which doctor doesn’t want real-time high-resolution results and which patient doesn’t want to see the results on Apple iPad or Samsung Galaxy Tab?

Architects, bankers and doctors have one thing in common: they get a better feeling for the current subject if they can play with the data. OpenCL makes it possible to process data much faster and thus let the specialist play with it. The interesting part of IT is that it is in every domain now and therefore a new series: OpenCL-potentials.

Continue reading “OpenCL Potentials: Medical Imaging”

OpenCL on the CPU: AVX and SSE

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE”

Thalesians talk – OpenCL in financial computations

End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.

After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.

Continue reading “Thalesians talk – OpenCL in financial computations”

Engineering GPGPU into existing software

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software”

New grown-ups on the block

Members of the band There is one big reason StreamHPC chose for OpenCL and that is (future) hardware-support. I talked about NVIDIA versus AMD a lot, but knowing others would join soon. AMD is correct when they say the future is fusion: hybrid computing with a single chip holding both CPU- and GPU-cores, sharing the same memory and interconnected at high speed. Merging the technologies would also give possible much higher bandwidths to memory for the CPU. Let us see in short which products from experienced companies will appear on the OpenCL-stage.

Continue reading “New grown-ups on the block”

OpenCL – the battle, part III

The first two parts described hardware-companies and operating systems, programming languages and software-companies, written about half a year ago. Now we focus on what has driven NVIDIA and ATI/AMD for decades: games.

Disclaimer: this is an opinion-piece on the current market. We are strong supporters of OpenCL and all companies which support it too. Since our advise on specific hardware in a consult will be based on specific demands on the customer, we could advise differently than would be expected on the below article.

Games

Computer games are cool; merely because you choose from so many different kinds. While Tetris will live forever, the latest games also have something to add: realistic physics simulation. And that’s what’s done by GPUs now. Nintendo has shown us that gameplay and good interaction are far more important than video-quality. The wow-factor for photo-realistic real-time rendering is not as it was years ago.
You might know the basics for falling objects: F = m*g (Force = Mass times Gravity-acceleration), and action = – reaction. If you drop some boxes, you can predict falling speed, interaction, rotation and possible change of centre of gravity from a still image as a human being. A computer has to do a lot more to detect collision, but the idea is very doable on a fast CPU. A very well-known open source library for these purposes is Bullet Physics. The nice thing comes, when there is more than just a few boxes, but thousands of them. Or when you walk through water or under a waterfall, see fire and smoke, break wood but bend metal, etc. The accelerometer of the iPod was a game-changer too in the demand for more realism in graphics. For an example of a “physics puzzle game” not using GPGPU see World of Goo (with free demo) – for the rest we talk more about high-end games. Of current game-ready systems PCs (Apple, Linux and Windows) have OpenCL support, Sony PlayStation 3 is now somewhat vague and the Xbox 360 has none.

The picture is from Crysis 3, which does not use OpenCL, as we know it.

Continue reading “OpenCL – the battle, part III”

OpenCL in the Clouds

Buzz-words are cool; they are loosely defined and are actually formed by the many implementation that use the label. Like Web 2.0 which is cool javascript for the one and interaction for the other. Now we have cloud-computing, which is cluster-computing with “something extra”. More than a year ago clouds were in the data-centre, but now we even have “private clouds”. So how to incorporate GPGPU? A cluster with native nodes to run our OpenCL-code with pre-distributed data is pretty hard to maintain, so what are the other solutions?

Distributed computing

Folding@home now has support for OpenCL to add the power of non-NVIDIA GPUs. While in clusters the server commands the clients what they have to do, here the clients ask the server for jobs. Disadvantage is that the clients are written for a specific job and are not really flexible to take different kind of jobs. There are several solutions for this code-distribution-problem, but still the solution is not suitable for smaller problems and small clusters.

Clusters: MPI

The project SHOC (Scalable HeterOgeneous Computing) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. While it is only a benchmark, it can be of great use when designing a cluster. For the rest I only found CUDA MPI-solutions, which are not ported to OpenCL yet.

Also check out Hoopoe, which is a cloud-computing service to run your OpenCL-kernels in their cloud. It seems to be more limited to .NET and have better support for CUDA, but it is a start. In Europe there is a start-up offering a rent-model for OpenCL-computation-time; please contact us if you want to get in contact with them.

Clusters: OpenMP

MOSIX has added a “Many GPU Package” to their cluster management system, so it now allows applications to transparently use cluster-wide OpenCL devices. When “choosing devices” not only the local GPU pops up, but also all GPUs in the cluster.
It works disk-less, in the way no files are copied to the computation-clients and all stays in-memory. Disk-less computations have an advantage when cloud-computer are not fully trusted. Take note that on most cloud-computers the devices need to be virtualised (see next part).

Below is its layered model, VCL being the “Virtual OpenCL Layer”.

They have chosen to base it on OpenMP; while the kernels don’t need to be altered, some OpenMP-code needs to be added. They are very happy to tell it takes much less code to use openMP instead of MPI.

You see a speed-up between 2.19 and 3.29 on 4 nodes is possible. We see comparable cluster-speed-ups in an old cluster-study. The actual speed-up on clusters depends mostly on the amount of data that needs to be transferred.

The project references to a project called remote CUDA, which only works with NVIDIA-GPUs.

Device Virtualisation

Currently there is no good device virtualisation for OpenCL. The gVirtuS-project currently only supports CUDA, but they claim it is easily rewritten to OpenCL. Code needs to be downloaded with a Mercurius-client (comparable to GIT and in repositories of most Linux-distributions):
> hg clone http://osl.uniparthenope.it/hg/projects/gvirtus/gvirtus gvirtus
Or download it here (dated 7-Oct-2010).

Let me know when you ported it to OpenCL! Actually gVirtuS does not do the whole trick since you need to divide the host-devices between the different guest-OSes, but luckily there is an extension which provides sharing of devices, called fission. More about this later.

We can all agree there still needs to be done a lot in this area of virtualised devices to get OpenCL in the cloud. If you can’t wait, you can theoretically use MOSIX locally.

Afterword

A cloud is the best buzz-word to market a scalable solution to overcome limitations of internet connected personal devices. I personally think the biggest growth will be in personal clouds, so companies will have their own in-house cloud-server (read: clusters); people just want to have a feeling of control, comparable with preference of a daily traffic jam above public transport. But nevertheless shared clouds have potential if it comes to computation-intensive jobs which do not need to be done all year round.

The projects presented here are a start to have OpenCL-power at a larger scale for more demanding cases. Since we can have more power at our fingertips with one desktop-pc stuffed with high-end video-cards than a 4-year-old supercomputer-cluster, there is still time

Please send your comment if I missed a project or method.

Learning both OpenCL and CUDA

Be sure to read Taking on OpenCL where I’ve put my latest insights – also for CUDA.

The two¹ “camps” OpenCL and CUDA both claim you should first learn their language first, after which the other would be easy to learn. I’m from the OpenCL-camp, so I say you should learn OpenCL first, but with a strong emphasis on hardware-architecture understanding. If I had chosen for CUDA I would have said the opposite, so in other words it does not matter which you do first. But psychology tells us that you probably like the first language more since there is where you discovered the magic; also most people do not like to learn a second language which is much alike and does not add a real difference. Most programmers just want to get the job done and both camps know that. Be aware of that.

NVIDIA is very good in marketing their products, AMD has – to say it modest – a lower budget for GPGPU-marketing. As a programmer you should be aware of this difference.

The possibilities of OpenCL are larger than those of CUDA, because of task-parallel programming and support for far more different architectures. At the other side CUDA is much more user-friendly and has a lot of convenience built-in.

Continue reading “Learning both OpenCL and CUDA”

Waiting for Mobile OpenCL

People who follow me, know my interest in ARM Cortex CPU & Mali GPU and Imagination Technology’s PowerVR, regarding OpenCL-potential. Here is an overview of what I found out until now and has more open ends than answers. I was very hopeful about access to mobile OpenCL for developers, but now I think it will have to wait until 2011. It seems that the mobile phone makers keep the power to themselves in the form of system-libraries instead of giving direct access. Actually a wise choice, given the fact that a bad kernel could easily crash the phone.

Imagination Technology’s PowerVR SGX product provides the GPU-power for the iPhone 4 and various new Android phones. The company does not provide an OpenCL SDK directly to end-users, but leaves the responsibility to the implementers of their IP. Strangely enough they offer SDKs for OpenGL, and say “no comment” to a pretty direct forum-question. In other words: we don’t know what to expect.

Samsung has the only official implementation for ARM (ARM Cortex-A9 MPCore CPU) for OpenCL. Samsung just released their home-made Linux-based mobile phone OS “Bada”, but there is not a sign of OpenCL in their API. Samsung being the only one who could open an API to developers on their phones officially, does not make me smile yet.

There are many, many voices that Apple iOS 4 has support for OpenCL, but then only in the system-libraries. Given the fact that Apple is a big fan of OpenCL, we can assume this is true. See the web for a lot of articles about this.

ZiiLabs is open about their support for OpenCL in their ZMS-05 processor, and has an early access program. Early access still means waiting.

Qualcomm’s Snapdragon does not mention OpenCL at all, while it powers the more powerful smartphones. It mentions a recent job-post, so you know the drill: wait.

Android has good support for OpenGL ES, but not officially for OpenCL.You might have heard of the particle experiment for Android, which is actually OpenGL-based. It also mentions the Android-version of the bullet engine (broken link) for physics-computations, but also no OpenCL. It doesn’t look like Android “Gingerbread” will have support.

So what did you learn after reading this? That developers still won’t have access to OpenCL-API on a mobile platform, and that you have to wait until next year or enter ZiiLabs’ early access program. More about the good and bad of hiding OpenCL on this blog.

Khronos OpenCL presentation at SIGGRAPH 2010

Here you find the videos uploaded by Khronos of their presentation about OpenCL. I added the time-line, so you can scroll to the more interesting parts easily. The presentation by Ofer Rosenberg of Intel and Cliff Woolly of NVIDIA were not uploaded (yet). Please note that for non-American people the speech of Affi Munchie is hard to hear; luckily his sheets explain most.

For the first two presentations the sheets can be downloaded from the Khronos-website. The time-line has the sheet-numbers mentioned.

0:00 [sheet 1] Presentation by the president of Khronos and chair of the session: Neill Trevett of NVIDIA.
0:06 [sheet 2] Welcome and a quick overview
1:12 [sheet 3] The prizes for the attendees (not us, online viewers)
1:40 [4] Overview of all members of Khronos. Khronos does not only take care of OpenCL but also the more famous OpenGL and projects like Collada.
2:26 [5] Processor Parallelism. CPUs are getting more parallel and GPUs more programmable. The overlapping area is called Heteregenous Computing and there is where OpenCL pops up.
3:10 [6] OpenCL timeline from version 1.0 to 1.1.
4:44 [7] OpenCL workinggroup with only 30 logos. He mentions missing logos like the one from Apple.
5:18 [8] The Visual Computing Ecosystem, where OpenCL interoperability with other standards are shown. The talk is not complete, so I don;t know if he talks about DirectX.

Continue reading “Khronos OpenCL presentation at SIGGRAPH 2010”

To think about: an OpenCL Wizard

A part of reaction on my earlier post was “VB Programmers span the whole range from hobbyists to scientists and professions. There is no reason that VB programmers should be left out of the loop in GPU programming. (…) CUDA and OpenCL are fundamentally lower level languages that do not have the same level of user-friendlyness that VB gives, and so they require a bit more care to program. (…)“. By selecting parts, I probably put the emphasis very wrong, but it brought me an idea: “how a Visual Basic Wizzard for OpenCL” would look like.

It is actually very interesting to think how to get GPGPU-power to a programmer who’s used to “and then and then and then”; how do you get parallel processing into the mind of an iterative thinker? Simplification is not easy!

By thinking about an OpenCL-wizard, you start to understand the bottlenecks of, and need for OpenCL initialisation.

Actually it could better be built into an existing framework like OpenMP (or something alike in Python, Java, etc) or the IDE could give hints that the standard-function could be replaced by a GPGPU-version. But we just want a wizard which generates “My First OpenCL Program”, which puts a smile on the face of programmers who use their mouse a lot.

Continue reading “To think about: an OpenCL Wizard”

The rise of the GPGPU-compilers

Painting "High Rise" made by Huma Mulji
Painting “High Rise” made by Huma Mulji

If you read The Disasters of Visual Designer Tools you’ll find a common thought about easy programming: many programmers don’t learn the hard-to-grasp backgrounds any more, but twiddle around and click a program together. In the early days of BASIC, you could add Assembly-code to speed up calculations; you only needed tot understand registers, cache and other details of the CPU. The people who did that and learnt about hardware, can actually be considered better programmers than the C++ programmer who completely relies on the compiler’s intelligence. So never be happy if the control is taken out of your hands, because it only speeds up the easy parts. An important side-note is that recompiling easy readable code with a future compiler might give faster code than your optimised well-engineered code; it’s a careful trade-off.

Okay, let’s be honest: OpenCL is not easy fun. It is more a kind of readable Assembly than click-and-play programming. But, oh boy, you learn a lot from it! You learn architectures, capabilities of GPUs, special purpose processors and much more. As blogged before, OpenCL probably is going to be the hidden power for non-CPUs wrapped in something like OpenMP.

Continue reading “The rise of the GPGPU-compilers”

OpenCL 1.1 changes compared to 1.0

This blog-entry is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.1, but just want an overview of the most important changes differences with 1.0.

The news-release sums up the changes for 1.1 like this:

  1. New datatypes including 3-component vectors and additional formats
  2. Handling command from multiple hosts and processing buffers across multiple devices
  3. Operations on regions of a buffer including read, write and copy of 1D, 2D and 3D rectangular regions
  4. Enhanced use of events to drive and control command execution
  5. Additional OpenCL C built-in functions such as integer clamp, shuffle and asynchronous strided copies
  6. Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

Furthermore we can read the update is completely backwards-compatible with version 1.0. The obvious macro‘s CL_VERSION_1_0 and CL_VERSION_1_1 are added to handle the versioning, but what’s more? This blog-post discusses most changes and with some subjective opinions added to it.

Additional Formats

3-component vectors

We only had 2-, 4-, 8 or 16-component vectors, but not 3 which actually was somewhat strange. The functions vload3, vload_half3, vloada_half3 and vstore3, vstore_half3, vstorea_half3 have been added to the family. Watch out, that for the half-functions the offset is calculated somewhat different compared to the even-sized vectors. In version 1.0 you could have chosen for a 4-component vector when using a lot of calculations, or a struct. If you see the new function vec_step below, it seems that it is not more memory-efficient to use this vector instead of a 4-component vector.

RGB with Padding

We have support CL_RGB, CL_RGBa (= RGB with an alpha-channel) and now also RGBx (with padding-channel). The same variants are there for CL_R and CL_RG. Good for graphics-programmers, or for easier reading of 32 bpp BMPs.

Cloud Computing / Multi-user Environments

The support for different hosts gives possibilities for cloud-computing. Side-note: cloud-computing is another word for multi-user environments, with some promotion for big data-centres. All API-functions except clSetKernelArg are thread-safe now; but only when kernels are not shared between hosts; see appendix A.2 for more information. The important part is that you think clearly about how to design your software if you now can assume others can take your resources now too. OpenCL already needed a lot of planning when claiming and releasing resources, so you’re probably already mastering it; now just check more often how much resources are available.

Region-specific Operations

Regions make it possible to split a big buffers to form a queue without having to keeping track of dimensions and offsets during operations, run from host. See clCreateSubBuffer for more information. A real convenience, but watch out when writing to overlapping buffers. The functions clEnqueueCopyBufferRec, clEnqueueReadBufferRect en clEnqueueWriteBufferRect helps synchronising commands to copy, read to or write from a region.

Enhanced Control-room

My favourite description of the host is “the control-room”, since you are not over there on the device but in Houston. The more control and information, the better. The new events are clSetMemObjectDestructorCallback, clCreateUserEvent, clSetUserEventStatus and clSetEventCallback. The first event-listener lets you know when resources a freed, so you can keep track. User-events can be put in the event_wait_list in various functions just like the built-in events; the function will start when all events are CL_COMPLETE. With clSetEventCallback immediate actions-on-events can be programmed; combined with the user-events the programmer got some powerful tools. See the example at clSetUserEventStatus for how to use the user-events.

OpenGL events and Direct3D support

The function clCreateEventFromGLsyncKHR links a CL-event to a GL-event by just giving the name of the OpenGL-event. See gl_sharing for more info.

OpenCL has now support for Direct3D 10, which is great! This might also be a good step to make DirectCompute lighter. See cl_khr_d3d10_sharing for more info. Welcome DirectX-developers! One favour: please be aware that DirectX works on Windows only, not on Apple OSX or iOS, (Embedded) Linux or Symbian. If you use clean calls, it will be more easy to port to other platforms.

Other New Kernel-functions

The following new functions were added to the kernel:

  • get_global_offset: returns the offset of the enqueued kernels.
  • minmag and maxmag: returns the argument with the minimum or maximum distance to zero, falls back to fmin and fmax if distance is equal or an argument is NaN. Example: maxmag(-5, 3) = -5, minmag(-3, 3) = -3.
  • clamp: returns boundary-values if the given number is not between the boundaries.
  • vec_step: returns the number of elements in a scalar or a vector. A scalar returns 1, a vector 2, 4, 8 or 16. If the size is 3, the function returns 4.
  • shuffle and shuffle2: shuffles one or two vectors given another vector with the indices of the new order. Indeed plain old permutations.
  • async_workgroup_strided_copy: buffers between global and local memory on the device. When used correctly, this can overcome some of the hassle when you need to work on global memory objects, but need more speed. Correct usage is described in the reference.

The functions min and max now also work component-wise with a vector as first argument and a scalar as second. Min({2, 4, 6, 8}, 5) will give {2, 4, 5, 5}.

Conclusion

While the many revisions of OpenCL 1.0 were really minor and not a lot attention was paid to them, 1.1 is a big step forward. If you see what has been done to multi-user environments, NVidia and AMD have a lot of work to do with their drivers.

You can read in revision 33 there has been some heated discussion and there was pressure on the decision:

>>Should this extension be KHR or EXT?
PROPOSED: KHR. If this extension is to be approved by Khronos then it should be KHR, otherwise EXT. Not all platforms can support this extension, but that is also true of OpenGL interop.
RESOLVED: KHR.<<

The part “not all platforms” is very politically written down, since exactly one platform supports this specific extension. I have seen too many of these pressured discussions and I hope Khronos is stronger than i.e. ISO and OpenCL will remain as open as OpenGL.

I’m very happy with the new version, since there is more control with loads of extra events, now multiple hosts are possible, and the forgotten 3-component vector was added. Now let me know in the comments what you think of the new version.

By the way, not all is new. Deprecated are clSetCommandQueueProperty and the __ROUNDING_MODE__ macro.