PDFs of Monday 16 April

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April”

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users world-wide in universities, research groups and industry.

Below you’ll find the benchmark results of our acceleration work. You can also test out yourself, as the code is public. In the readme-file you can learn more about the idea of the software. Lots of background information is described in these two papers:

We chose PaSWAS because we really like bio-informatics and computational chemistry – the science is interesting, the problems are complex and the potential GPU-speedup is real. Other examples of such software we worked on are GROMACS and TeraChem.

Continue reading “We accelerated the OpenCL backend of pyPaSWAS sequence aligner”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

OpenCL vs CUDA Misconceptions


Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)


Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

Also found that Google Insights agrees with what I constructed manually.

Continue reading “OpenCL vs CUDA Misconceptions”

Nokia Maemo and OpenCL

Update 21-06-2011: Bumped into a project by Nokia: CLEP, “OpenCL Embedded Profile” for the N900.

Maemo is the Debian based Linux-distribution of Nokia for embedded devices. It is on the gadget N900, so you can be root on your own phone and compile your own kernel. In other words: a great developer’s phone.

Which smartphone to buy when you want to toy around with OpenCL “Embedded Profile”? There is more and more evidence that the next iPhone OS will have support for OpenCL, as should be expected Apple being the trademark-owner of OpenCL. This is good, since the mobile market could make the difference for the technique – competing with CUDA and DirectCompute. “The other ARM Cortex-A8 smartphone”, the Nokia N900 does not support it, while the magic of OpenCL attracts to many developers on the Maemo-forums.

The QT-blog that disclosed coming OpenCL-support for QT, spoke about it too:

>>Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we’ve tested it under Linux, Mac, and Windows). Embedded devices are currently another matter – OpenCL implementations are still very basic in that space.  The performance improvements on embedded CPU’s are only slightly better than using ARM/NEON instructions for example.  And embedded GPU’s are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing.  But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with. Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!<<

But checking the whole Nokia QT/Maemo-SDK for something like “opencl.h” or words like “opencl” and “khronos” in .h-files did not return anything interesting. The missing reference in the SDK tells me, we cannot expect any OpenCL-implementation on the N900 soon. So do we have to wait for the Nokia N920, Maemo 6 and QT 4.8? Once I know more, by getting deeper into the SDK, you’re the first to know. But first let me show you the documents which tells us OpenCL is coming to the Maemo-platform.

The Maemo Base Port Document, version 1.1

Exhibit number 1. The introduction tells us that the document describes what hardware-designers should do to get Maemo working on their device:

>>When Maemo is ported to a new chipset and HW environment, the majority of the SW worktakes place in the base layer. However, some adjustments may also be needed in the otherlayers. The porting work as a whole is a combined effort by the chipset vendor and Nokia. Thisdocument describes the deliverables expected from the chipset vendor in such an effort. The requirements in this document are expressed in the form of SW component, interface andfunctional requirements. Note that in many cases more detailed discussions are neededbetween Nokia and the chipset vendor to reach a common understanding about the specificsof the system architecture and the required component versions, functionality and interfaces.<<

So the document describes what the hardware must support, to be able to run Maemo. Let’s then find the magic word “OpenCL”:

>>Graphics Adaptation. The Base Port graphics adaptation interfaces consist of X11, OpenGL ES, and OpenVG interfaces. The OpenCL interface is also included in this group since it typically is used to access the GPU for general-purpose parallel computation.<<

And somewhat below:

>>OpenCL 1. The Base Port should provide an implementation of the OpenCL 1.0 interface for general-purpose parallel programming of heterogeneous systems, especially for the use of GPUs for computation (Khronos group standard).<<

That seems to be pretty clear that Maemo-devices must be able to support OpenCL.

http://www.forum.nokia.com/piazza/wiki/images/7/7d/Maemo_Base_Port_v1.1.pdf

Paper “OpenCL on Embedded devices” by Nokia

Exhibit 2 shows tests of a few simple OpenCL-program on an unnamed device with a TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz POWERVR SGX530 GPU) – which happens to be in the Motorola Droid, Palm Pre, and Nokia N900. So they managed to create a OpenCL-implementation on ARM. If you’re interested in OpenCL for embedded devices, please do read this presentation:

http://www.khronos.org/developers/library/2009-hotchips/Nokia_OpenCL-in-Handheld-Devices.pdf

It is a document from august 2009, which shows they actually were trying POWERVR and OpenCL then. Now with QT and Maemo mentioning it, we can be very sure the N900 or the N920 is eventually going to have OpenCL-support.

Accelerating an Excel Sheet with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL”

video: OpenCL on Android

Michael-Leahy-talk-videoMichael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

  • What is OpenCL
  • 13 dwarfs
  • RenderScript
  • Demo

http://www.youtube.com/watch?v=XQCYWmYCJWo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

Does GPGPU have a bright future?

This post has a focus towards programmers. The main question “should I invest in learning CUDA/OpenCL?”

Using the video-processor for parallel processing is actually possible since beginning 2006; you just had to know how to use the OpenGL Shader Language. Not long after that (end 2006) CUDA was introduced. A lot has happened after that, which resulted in the introduction of OpenCL in fall 2008. But actually the acceptance of OpenCL is pretty low. Many companies which do use it, want to have it as their own advantage and don’t tell the competition they just saved hundreds of thousands of Euros/Dollars because they could replace their compute-cluster with a single computer which cost them €10 000,- and a rewrite of the calculation-core of their software. Has it become a secret weapon?

This year a lot of effort will be put to integrate OpenCL within the existing programming languages (without all the thousands of tweak-options visible). Think about wizards around pre-built kernels and libraries. Next year everything will be around kernel-development (kernels are the programs which do the actual calculations on the graphics processor). The year after that, the peak is over and nobody knows it is built in their OS or programming-language. It’s just like current programmers use security-protocols, but don’t  know what it actually is.

If I want to slide to the next page on modern mobile phones, I just make a call to a slide-function. A lot is happening when the function is called, such building up the next page in a separate part of memory, calling the GPU-functions to show the slide, possibly unloading the previous page. The same is with OpenCL; I want to calculate a FFT with specified precision and I don’t want to care on which device the calculation is done. The advantage of building blocks (like LEGO) is that we keeps the focus of development on the end-target, while we can tweak it later (if the customer has paid for this extra time). What’s a bright future if nobody knows it?

Has it become a secret weapon?

Yes and no. Companies want to brass about their achievements, but don’t want the competitors to go the same way and don’t want their customers to demand lower prices. AMD and NVidia are pushing OpenCL/CUDA, so it won’t stop growing in the market, but actually this pushing is the biggest growth in the market. NVidia does a good job with marketing their CUDA-platform.

What’s a bright future if nobody knows it?

Everything that has market-wide acceptation has a bright future. It might be replaced by a successor, but acceptance is the key. With acceptance there always will be a demand for (specialised) kernels to be integrated in building blocks.

We also have the new processors with 32+ cores, which actually need to be used; you know the problem with dual-core “support”.

Also the mobile market is growing rapidly. Once that is opened for OpenCL, there will be a huge growth in demand for accelerated software.

My advise: if high performance is very important for your current or future tasks, invest in learning how to write kernels (CUDA or OpenCL, whatever your favourite is). Use wrapper-libraries which make it easy for you, because once you’ve learned how to use the OpenCL-calls they are completely integrated in your favourite programming language.

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Brochures

We have two brochures. One general and one for training. You can also generate PDFs from most pages on this website to support off-line discussion.

New versions will arrive soon – unfortunately there have been quite some delays.

[columns]
[one_half title=”Training”]

In the brochure for trainings (version January 2012) all training modules are written out.

Download our trainings brochure here.

[/one_half]
[one_half title=”General”]

In the general brochure (version May 2011) we explain both our consultancy and training services.

Download our general brochure here.

[/one_half]
[/columns]

StreamHPC.Brochure.2010-01.EN.pdf

Opinions crossing the table: Khronos for world peace

languages
Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

OpenCL Basics: Running multiple kernels in OpenCL

This series “Basic concepts” is based on GPGPU-questions we get via email more than once, or when the question is not clearly explained in the books. For one it is obvious, for the other just what they’re missing.

They say that learning a new technique is best done by playing around with working code and then try to combine it. The idea is that when you have Stackoverflowed and Githubed code together, you’ve created so many bugs by design that you’ll learn a lot if you make it work. When applying this to OpenCL, you quickly get to a situation that you want to run one.cl file and then another.cl file. Almost all beginner’s material discuss a single OpenCL-file, so how to do this elegantly?

Continue reading “OpenCL Basics: Running multiple kernels in OpenCL”

Basic Concepts: online kernel compiling

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling”

OpenCL potentials: Watermarked media for content-protection

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection”

OpenCL SPIR by example

SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

 

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

Performance of 5 accelerators in 4 images

runningIf there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip
Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

 

gflops-per-chip
The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

 

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

  • Intel Xeon 2699 v3: € 6,560.
  • Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
  • NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
  • NVidia Tesla K40: € 4,070
  • AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

 

Costs per GB/s
Where the four accelerators are around €11 per GB/s, the Xeon takes €131 (see note above). Note that the K40 with €14.13 is expensive compared to the other accelerators.

 

costs-per-gflops-per-chip
For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

Company History

There are not many companies like Stream HPC in Europe. Most others are or a government-institute for the national supercomputer, freelancers or actually not experienced with GPUs. So how did it start?

2009: The bore-out

Stream’s founder Vincent Hindriksen had to maintain a piece of software that was often failing to process the daily reports. After documenting the internals and algorithms of the code by interviewing the key people and some reverse engineering, it was a lot easier to create effective solutions for the bugs within the software. After fixing a handful of bugs, there was simply a lot less to do except reading books and playing online games.

To avoid becoming a master in Sudoku, he spent the following three weeks in rewriting all the code, using the freshly produced documentation. 2.5 hours needed to process the data was reduced to 19 seconds – yes, the kick for performance optimization was already there. For some reason it took well over 6 months to port the proof-of-concept, which was simply unbearable as somebody had to make sure the old code was maintained for 40 hours a week. As he was the only one who understood the code, there was no option to get placed at another project.

This ended in a bore-out: no wanting to go to work anymore. It’s actually quite the same as a burn-out, but with a different cause.

2010: a new start

From that bore-out the company was born the next year. There were two options: GPGPU (mostly OpenCL, a hobby) or build smart products for public transport. Two domains were bought, and the choice was made during the year. For the public transport a proof-of-concept was made, but the choice fell for the really difficult work.

Not much money was earned that year. Even government-support had to be paid back as one invoice was sent 2 weeks too early.

2011-2013: What’s a GPU?

We now have a clear idea on what GPUs can do, but in 2010-2014 GPUs were still for graphics only and sales were very difficult. Selling to somebody who states “GGGGGGraphics Processing Unit” is quite difficult.

A loan of €4000 by Vincent’s grandmother, a landlord who was relaxed with payments, the trust in the technology by early customers, and late payments to our creditors got us through.

2014: Employee #1

There was still not a stable income. Sales&marketing also took a lot of time, hurting the time that could be spent on actual work. But slowly we got more traction – more people started to believe in the company’s vision.

But by the end of the year the first employee was hired, Anca.

As the choice was to build a services company instead of a products company, banks and investors were not even interested in providing financial support. We can now say that for the long term this was the best – we can now fully control our own strategies and invest in our own product development.

2015-2020: First growth phase

Growth is hard, really hard. And we learned that, well, the hard way. Several decisions would now be made differently, but we adopted and continued. Some examples:

  • Investing in FPGAs too early. OpenCL-on-FPGAs was the next big thing, so based on what we got promised by vendors, we made the same promises to our customers. Many promises did not turn into reality.
  • Hiring the wrong people. Or: hiring people for whom we are the wrong company, as it goes both ways. We now define our culture, because we want people who fit our culture.
  • All the other things that are in the books under “early stage growth”.

By 2021 we got past the growth pains and go into the second phase.

2021: The second office

The first choice was actually in Belgium, because it was closer to Amsterdam. Unfortunately that project did not succeed. By coincidence we got into Budapest, and grew out of the office space the first year.

Basic concepts: malloc in the kernel

22489954_ml
Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,
cl_uint arg_index,
size_t arg_size,
const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Supporting OpenCL on your own hardware

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.