NVIDIA enables OpenCL 2.0 beta-support

Posted by Jakub Szuppe on 22 February 2017

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support” →

PDFs of Monday 16 April

Posted by Vincent Hindriksen on 16 April 2012

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April” →

Nokia Maemo and OpenCL

Posted by Vincent Hindriksen on 1 June 2010 with 2 Comments

Update 21-06-2011: Bumped into a project by Nokia: CLEP, “OpenCL Embedded Profile” for the N900.

Maemo is the Debian based Linux-distribution of Nokia for embedded devices. It is on the gadget N900, so you can be root on your own phone and compile your own kernel. In other words: a great developer’s phone.

Which smartphone to buy when you want to toy around with OpenCL “Embedded Profile”? There is more and more evidence that the next iPhone OS will have support for OpenCL, as should be expected Apple being the trademark-owner of OpenCL. This is good, since the mobile market could make the difference for the technique – competing with CUDA and DirectCompute. “The other ARM Cortex-A8 smartphone”, the Nokia N900 does not support it, while the magic of OpenCL attracts to many developers on the Maemo-forums.

The QT-blog that disclosed coming OpenCL-support for QT, spoke about it too:

>>Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we’ve tested it under Linux, Mac, and Windows). Embedded devices are currently another matter – OpenCL implementations are still very basic in that space. The performance improvements on embedded CPU’s are only slightly better than using ARM/NEON instructions for example. And embedded GPU’s are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing. But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with. Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!<<

But checking the whole Nokia QT/Maemo-SDK for something like “opencl.h” or words like “opencl” and “khronos” in .h-files did not return anything interesting. The missing reference in the SDK tells me, we cannot expect any OpenCL-implementation on the N900 soon. So do we have to wait for the Nokia N920, Maemo 6 and QT 4.8? Once I know more, by getting deeper into the SDK, you’re the first to know. But first let me show you the documents which tells us OpenCL is coming to the Maemo-platform.

The Maemo Base Port Document, version 1.1

Exhibit number 1. The introduction tells us that the document describes what hardware-designers should do to get Maemo working on their device:

>>When Maemo is ported to a new chipset and HW environment, the majority of the SW worktakes place in the base layer. However, some adjustments may also be needed in the otherlayers. The porting work as a whole is a combined effort by the chipset vendor and Nokia. Thisdocument describes the deliverables expected from the chipset vendor in such an effort. The requirements in this document are expressed in the form of SW component, interface andfunctional requirements. Note that in many cases more detailed discussions are neededbetween Nokia and the chipset vendor to reach a common understanding about the specificsof the system architecture and the required component versions, functionality and interfaces.<<

So the document describes what the hardware must support, to be able to run Maemo. Let’s then find the magic word “OpenCL”:

>>Graphics Adaptation. The Base Port graphics adaptation interfaces consist of X11, OpenGL ES, and OpenVG interfaces. The OpenCL interface is also included in this group since it typically is used to access the GPU for general-purpose parallel computation.<<

And somewhat below:

>>OpenCL 1. The Base Port should provide an implementation of the OpenCL 1.0 interface for general-purpose parallel programming of heterogeneous systems, especially for the use of GPUs for computation (Khronos group standard).<<

That seems to be pretty clear that Maemo-devices must be able to support OpenCL.

http://www.forum.nokia.com/piazza/wiki/images/7/7d/Maemo_Base_Port_v1.1.pdf

Paper “OpenCL on Embedded devices” by Nokia

Exhibit 2 shows tests of a few simple OpenCL-program on an unnamed device with a TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz POWERVR SGX530 GPU) – which happens to be in the Motorola Droid, Palm Pre, and Nokia N900. So they managed to create a OpenCL-implementation on ARM. If you’re interested in OpenCL for embedded devices, please do read this presentation:

http://www.khronos.org/developers/library/2009-hotchips/Nokia_OpenCL-in-Handheld-Devices.pdf

It is a document from august 2009, which shows they actually were trying POWERVR and OpenCL then. Now with QT and Maemo mentioning it, we can be very sure the N900 or the N920 is eventually going to have OpenCL-support.

OpenCL vs CUDA Misconceptions

Posted by Vincent Hindriksen on 22 June 2011 with 47 Comments

Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)

Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

Also found that Google Insights agrees with what I constructed manually.

Continue reading “OpenCL vs CUDA Misconceptions” →

Accelerating an Excel Sheet with OpenCL

Posted by Vincent Hindriksen on 19 September 2016

excel-opencl One of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL” →

video: OpenCL on Android

Posted by Vincent Hindriksen on 11 February 2014 with 1 Comment

Michael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

What is OpenCL
13 dwarfs
RenderScript
Demo

http://www.youtube.com/watch?v=XQCYWmYCJWo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

Basic Concepts: online kernel compiling

Posted by Vincent Hindriksen on 28 October 2011 with 1 Comment

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling” →

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

Posted by Vincent Hindriksen on 18 August 2016

fast-fpga From 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Does GPGPU have a bright future?

Posted by Vincent Hindriksen on 15 April 2010 with 4 Comments

This post has a focus towards programmers. The main question “should I invest in learning CUDA/OpenCL?”

Using the video-processor for parallel processing is actually possible since beginning 2006; you just had to know how to use the OpenGL Shader Language. Not long after that (end 2006) CUDA was introduced. A lot has happened after that, which resulted in the introduction of OpenCL in fall 2008. But actually the acceptance of OpenCL is pretty low. Many companies which do use it, want to have it as their own advantage and don’t tell the competition they just saved hundreds of thousands of Euros/Dollars because they could replace their compute-cluster with a single computer which cost them €10 000,- and a rewrite of the calculation-core of their software. Has it become a secret weapon?

This year a lot of effort will be put to integrate OpenCL within the existing programming languages (without all the thousands of tweak-options visible). Think about wizards around pre-built kernels and libraries. Next year everything will be around kernel-development (kernels are the programs which do the actual calculations on the graphics processor). The year after that, the peak is over and nobody knows it is built in their OS or programming-language. It’s just like current programmers use security-protocols, but don’t know what it actually is.

If I want to slide to the next page on modern mobile phones, I just make a call to a slide-function. A lot is happening when the function is called, such building up the next page in a separate part of memory, calling the GPU-functions to show the slide, possibly unloading the previous page. The same is with OpenCL; I want to calculate a FFT with specified precision and I don’t want to care on which device the calculation is done. The advantage of building blocks (like LEGO) is that we keeps the focus of development on the end-target, while we can tweak it later (if the customer has paid for this extra time). What’s a bright future if nobody knows it?

Has it become a secret weapon?

Yes and no. Companies want to brass about their achievements, but don’t want the competitors to go the same way and don’t want their customers to demand lower prices. AMD and NVidia are pushing OpenCL/CUDA, so it won’t stop growing in the market, but actually this pushing is the biggest growth in the market. NVidia does a good job with marketing their CUDA-platform.

What’s a bright future if nobody knows it?

Everything that has market-wide acceptation has a bright future. It might be replaced by a successor, but acceptance is the key. With acceptance there always will be a demand for (specialised) kernels to be integrated in building blocks.

We also have the new processors with 32+ cores, which actually need to be used; you know the problem with dual-core “support”.

Also the mobile market is growing rapidly. Once that is opened for OpenCL, there will be a huge growth in demand for accelerated software.

My advise: if high performance is very important for your current or future tasks, invest in learning how to write kernels (CUDA or OpenCL, whatever your favourite is). Use wrapper-libraries which make it easy for you, because once you’ve learned how to use the OpenCL-calls they are completely integrated in your favourite programming language.

Opinions crossing the table: Khronos for world peace

Posted by Vincent Hindriksen on 7 December 2016

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace” →

Brochures

We have two brochures. One general and one for training. You can also generate PDFs from most pages on this website to support off-line discussion.

New versions will arrive soon – unfortunately there have been quite some delays.

[columns]
[one_half title=”Training”]

In the brochure for trainings (version January 2012) all training modules are written out.

Download our trainings brochure here.

[/one_half]
[one_half title=”General”]

In the general brochure (version May 2011) we explain both our consultancy and training services.

Download our general brochure here.

[/one_half]
[/columns]

StreamHPC.Brochure.2010-01.EN.pdf

Performance of 5 accelerators in 4 images

Posted by Vincent Hindriksen on 8 December 2015

running If there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip — Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

gflops-per-chip — The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

Intel Xeon 2699 v3: € 6,560.
Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
NVidia Tesla K40: € 4,070
AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

costs-per-gflops-per-chip — For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

OpenCL potentials: Watermarked media for content-protection

Posted by Vincent Hindriksen on 23 November 2011

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection” →

OpenCL Basics: Running multiple kernels in OpenCL

Posted by Vincent Hindriksen on 15 October 2018

This series “Basic concepts” is based on GPGPU-questions we get via email more than once, or when the question is not clearly explained in the books. For one it is obvious, for the other just what they’re missing.

They say that learning a new technique is best done by playing around with working code and then try to combine it. The idea is that when you have Stackoverflowed and Githubed code together, you’ve created so many bugs by design that you’ll learn a lot if you make it work. When applying this to OpenCL, you quickly get to a situation that you want to run one.cl file and then another.cl file. Almost all beginner’s material discuss a single OpenCL-file, so how to do this elegantly?

Continue reading “OpenCL Basics: Running multiple kernels in OpenCL” →

Our Onboarding Process

We currently have 6 months for onboarding. This helps you, as our new colleague, get used to the different environment, the company’s priorities and ways of working.

What requirements we signal before hiring

The self-assessment contains all the elements we try to foster. It shows 4 main areas:

CPU programming & algorithms
GPU programming & performance oriented programming
Problem solving
Collaboration, planning & prioritization

Each of these, and some personal skills, we discuss, improve and measure. Six months is never enough to improve on everything, but it is enough to focus on the items where most need to be done. It’s certainly enough time to understand what quality means and how to get there.

It’s easy to get confused about quality. Quality might mean deluxness, fanciness, expensiveness. That’s not what quality means. Quality means meeting spec. Keeping the promise. Doing exactly what you said you would do. It is measured in consistency.
Seth Godin

First week(s)

Getting started means you get everything on your desk at once. Best to have a checklist:

Creation of accounts
Ask questions you might have to your buddy.
Introduction to others.
Learning the basics of the development and collaboration environments
Get necessary hardware, including noise-canceling headphones.
Make technical contributions from the first day, as this provides:
- Measurable contributions
- Time management

What to expect in your first months

In your first weeks, you’ll ease into a project, pair up with a buddy, meet teammates (both in-office and remote), and get familiar with how we work, daily standups, time tracking, and internal tools like GitLab. You’ll also learn about our team culture, performance expectations, and key documents that shape how we collaborate. Regular check-ins at one, and five months help track your growth, gather feedback, and support your long-term development.

OpenCL SPIR by example

Posted by Vincent Hindriksen on 27 December 2013 with 1 Comment

OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

Continue reading →

Supporting OpenCL on your own hardware

Posted by Vincent Hindriksen on 2 April 2012

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware” →

Basic concepts: malloc in the kernel

Posted by Vincent Hindriksen on 30 November 2013

22489954_ml — Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,

cl_uint arg_index,

size_t arg_size,

const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Visit us (Amsterdam)

So we invited you over? Cool! See you soon!

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Poort business area. Below you’ll find information on how to get there.

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

Route	Travel time (outside rush hours)
Office – Schiphol	15 minutes
Office – The Hague	40 minutes
Office – Utrecht	35 minutes
Office – Rotterdam	50 minutes

Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

*View in the direction of the office from the metro station*

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).
Subway: Line 50, also connecting to Amsterdam train-stations Lelylaan, Zuid, RAI and Bullewijk. In case there are problems with the train to Schiphol, go to Amsterdam Zuid and take the train from there.

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

Inside

When entering the front door, go to the right to find the elevators. There go to the 6th floor. From the elevators, go to the North. You’ll see our sign!

*Stream HPC has the blue marked office.*

IWOCL 2017 Toronto call for talks and posters is open

Posted by Vincent Hindriksen on 30 December 2016

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

What does Khronos has more to offer than OpenCL and OpenGL?

Posted by Vincent Hindriksen on 24 November 2014

The OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

OpenCL: compute
WebCL: web compute
SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
SYCL: high-level language for OpenCL

OpenGL related

Vulkan: state-less graphics
OpenGL: graphics
OpenGL ES: embedded graphics
WebGL: web graphics
glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
OpenGL SC: Graphics for Safety Critical operations
EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

OpenMAX: interface for multimedia codecs, platforms and hardware
StreamInput: interface for sensors
OpenVX: OpenCV-alternative, built for performance.
OpenKCam: interface for cameras and sensors

Others

COLLADA: 3D model file format
OpenSL ES: embedded audio
OpenVG: vector Graphics

One video called “OpenRoad” to show them all:

http://www.youtube.com/watch?v=ckD0op6OgMQ

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/