GPGPU-day materials – teaser

Just a quick teaser. More materials (photos, sheets, videos) are coming soon.

Don’t forget to subscribe to the mailing-list of Platform Parallel Netherlands to hear about more events around parallel programming in the Netherlands.

Click on the icon at bottom-right to watch the video full-screen.

If you have made photos during the day, please send them.

Music by Professor Kliq.

Below is the short version with photos only

StreamComputing is 2 years old! A personal story.

More than two years ago, on 13 January 2010, I wrote my first blog-post. Four months later StreamComputing (redacted: rebranded to StreamHPC in 2017) was both official and unknown. I want to share with you my personal story on how I got to start-up this company.

The push-factor

I wanted to create a company which was about innovative projects – something I had hardly encountered until then. The years before I programmed parts of A-to-B-flows, as I call them. That is software that is in the base quite simple, but tediously discussed as very, very complex.

“Complex” software

The complexity is not the software, as you can see. It is undocumented APIs, forgotten knowledge, knowledge in heads of unknown people, bossy and demanding people who friendly ask for last-minute architecture changes, deadlines around promotion-rounds, new deadlines due to board-decisions, people being afraid of getting replaced if the software is finished, jealousy if another team makes version 2 of the software, etc. The rule of office-software is therefore understandable:

Software is either unfinished,
or turned into a platform for unintended functionality.

The fun in office-software is there for analyst, architect or manager – the developer just puts in his earphones and makes all the requested changes (hooray for services like Spotify). But as I did not want to become a manager and wished to keep improving my development skills, I had to conclude I was on the wrong track.

Continue reading “StreamComputing is 2 years old! A personal story.”

AMD gDEBugger 6.2 for Linux

The printf-funtion in kernels isn’t the solution to everything, so hence profilers and debuggers specially tailored for GPU-programming. On Windows there is a lot of choice, but mostly only if you have a paid version of Visual Studio. On Linux you have GDB, but that program is not really user-friendly for the GUI-lovers.

For AMD there is now gDEBugger again available for Linux. Again, as version 5.8 by Gremedy worked with Linux, after AMD bought the company it got Windows-only for version 6. A few weeks ago, 10 months after 6.0, Linux-binaries got back with version 6.2. It supports OpenCL 1.2, OpenGL 3.2 and quite some extensions. As only AMD is supported, later more on debugging OpenCL-applications on NVidia and Intel.

Installation is quite straightforward. For creating a menu-item, you’ll find an useful image in /opt/gDEBugger6.2.xxx/tutorial/images/.

Continue reading “AMD gDEBugger 6.2 for Linux”

NVIDIA: mobile phones, tablets and HPC (cloud)

If you want to see what is coming up in the market of consumer-technology (PC, mobile and tablet), then NVIDIA can tell you the most. The company is very flexible, and shows time after time it really knows in which markets is currently operates and can enter. I sometimes strongly disagree with their marketing, but watch them closely as they are in the most important markets to define the near future in: PCs, Mobile/Tablet and HPC.
You might think I completely miss interconnects (buses between processors, devices and memory) and memory-technologies as clouds have a large need for high-speed data-transport, but the last 20 years have shown that this is a quite stable developing market based on IP-selling to the hardware-vendors. With the acquisition of Cray’s interconnect technology, we have seen this is serious business for Intel, so things might change indeed. For this article I want to focus on NVIDIA’s choices.

Neil Trevett on OpenCL

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL”

USB-stick sized ARM-computers

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers”

PDFs of Monday 16 April

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April”

5 types of loops you should avoid

In “Separation of compute, control and transfer” I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.

The more you think about it, the more you find that a loop is not a loop.

Continue reading “5 types of loops you should avoid”

Supporting OpenCL on your own hardware

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware”

Separation of Compute and Transfer from the rest of the code.

What if trees had the roots, trunk and crown were mixed up? Would it still have the advantage over other plants?

In the beginning of 2012 I spoke with Patrick Viry, former CEO of Ateji – now out-of-business. We shared ideas on GPGPU, OpenCL and programming in general. While talking about the strengths of his product, he came with a remark which I found important and interesting: separation of transfer. This triggered me to think further – those were the times when you could not read on modern computing, but had to define it yourself.

Separation of focus-areas are known to increase effectiveness, but are said to be for experts only. I disagree completely – the big languages just don’t have good support for defining the separations of concerns.

For example, the concepts of loops is well-known to all programmers, but OpenCL and CUDA have broken with that. Instead of using huge loops, those languages describe what has to be done at one location in the data and what the data is to be processed. From what I see, this new type of loop is getting abandoned in higher level languages, while it is a good design pattern.

I would like to discuss separation of compute and transfer from the rest of the code, to show that this will improve the quality of code. Continue reading “Separation of Compute and Transfer from the rest of the code.”

StreamHPC flirts with ARM

With the launch of twitter-channel @OpenCLonARM we now officially show a strong interest in ARM for compute. And we are not the only ones, as the twitter already has 80 followers (60 in 1.5 day and 12 retweets of the welcome-message).

ARM has made tremendous progress in both technology and market-share. With ARM-64, companies like NVidia (and maybe AMD) in the field, X86 seems to be getting a real competitor. This could happen because since a few years computers are fast enough and are not being replaced by a faster one, but a smaller one (tablet, phone) or extra one. By the rules of the market, current technologies are replaced by the ones that give those other needs. ARM is fast (enough), flexible in design, very cheap, low-power and passively cooled. The biggest obstacle seems to be only getting a standard for a docking-station to connect your mobile, tablet or watch to keyboard, mouse and large screen.

OpenCL is perfect for ARM, as it gives the computation-power to the intensive computations not already covered by hardware-support. In the world of X86 this interests high performance and big data companies, where on ARM this interests also more. Without the need for OpenCL you can already watch HD video, with OpenCL you can encode the video with MP4. This year you will certainly hear more about new possibilities of OpenCL on ARM.

What do you think. Why does Intel not sell IP to ARM-companies as many technologies could be reused? Could Intel be the next ARM as an IP-seller, or will they stay the defender of X86 for many years to come?

streamhpc.com is not affiliated with ARM.

AccelerEyes ArrayFire

There is a lot going on at the path to GPGPU 2.0 – the libraries on top of OpenCL and/or CUDA. Among many solutions we see for example Microsoft with C++ AMP on top of DirectCompute, NVidia (and more) with OpenACC, and now AccelerEyes (most known for their Matlab-extension Jacket and libJacket) with ArrayFire.

I want you to show how easy programming GPUs can be when using such libraries – know that for using all features such as complex numbers, multi-GPU and linear algebra functions, you need to buy the full version. Prices start at $2500,- for a workstation/server with 2 GPUs.

It comes in two flavours: for OpenCL (C++) and for CUDA (C, C++, Fortran). The code for both is the same, so you can easily switch – though you still see references to cuda.h you can compile most examples from the CUDA-version using the OpenCL-version with little editing. Let’s look a little into what it can do.

Continue reading “AccelerEyes ArrayFire”

Theoretical transfer speeds visualised

There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right. What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

Continue reading “Theoretical transfer speeds visualised”