General articles on technical subjects.

12-14 June: OpenCL Training Amsterdam

Posted by Vincent Hindriksen on 25 April 2013

From 12 to 14 June StreamHPC will give a 3-day course in OpenCL (was 3 to 5 June). Here you will learn how to develop OpenCL-programs.

A separate ticket for only the first day can be bought, as then will be a crash-course into OpenCL. Module basics.

The second and third day will all about parallel-algorithm design, optimisation and error-handling. Module optimisation with several new subjects added.

The last part of the third day is reserved for special subjects, as requested by the attendees. Continue reading “12-14 June: OpenCL Training Amsterdam” →

Scaling mobile GPUs to 1000 GFLOPS

Posted by Vincent Hindriksen on 22 April 2013

On the 20th of April 2013 there was an interesting discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage). David is a specialist in CPU and GPU architectures (twitter, homepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, but there was something about this one that made me want to share it with special attention.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash. http://www.fpgacpu.org/papers/Gray_AutumnOfMooresLaw_SingularityUniversity_11-06-23.pdf …

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell, all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Continue reading “Scaling mobile GPUs to 1000 GFLOPS” →

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

Posted by Vincent Hindriksen on 3 April 2013 with 2 Comments

WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL” →

LEAP-conference call for papers

Posted by Vincent Hindriksen on 5 February 2013

921752_m — Building bridges in a new industry

Embedded processors always have had the focus on low-energy. Now a combination of Moore’s law, the frequency-wall and multi-processor developments have made it possible for these processors to compete in completely new market segments. Most notable due to impressive advancements in graphics IP.
We are now looking at four groups who are interested in learning from each other:

The embedded processor market
The FPGA market
The HPC and server market
The GPGPU market

And answer the question: how can we get more out of low-energy processors by looking at other industries?

The goal of the LEAP conference is to bring these three groups together. Creating the windows to each other and paving roads over the newly constructed bridges. This makes it one of its kind. Half of the conference is focused on quality information sharing and the other half on networking. For more information, check the website of the LEAP-conference. StreamHPC is a co-organiser.

~~Call for papers is now open!~~ Programme is filled!

Continue reading →

OpenCL Basics: Flags for the creating memory objects

Posted by Vincent Hindriksen on 3 February 2013 with 12 Comments

In OpenCL large memory objects, residing in the main memory of the host or the global memory at the accelerator/GPU, need special treatment. First reason is that these memories are relatively slow. Second reason is that the most times serial copy of objects between these two memories take time.

In this post I’d like to discuss all the flags for when creating memory objects, and what they can do to assist in this special treatment.

This is explained on this page of clCreateBuffer in the specifications, but I think it is not really clear. The function clCreateBuffer (and the alike functions for creating images, sub-buffers, etc) suggests that you create a special OpenCL-object to be given as argument to the kernel. What actually happens is that space is made available in main memory of the accelerator and optionally a link with host-memory is made.

The flags are divided over three groups: device access, host access and host pointer flags.

Continue reading “OpenCL Basics: Flags for the creating memory objects” →

Low-Energy Application Parallelism (LEAP) conference in London

Posted by Vincent Hindriksen on 23 January 2013

For more information on the program, contact us. For information on sponsoring, contact Tim Lewis of CroftEdge.

Website with more info and ticket-sales will open 1 February.

StreamHPC is at HiPEAC, Berlin

Posted by Vincent Hindriksen on 9 January 2013

HiPEAC is the conference on “High-performance and Embedded Architectures and Compilers” to be held on 20 to 22 January in Berlin – tickets are still available, if you can spare three days. I (Vincent Hindriksen) would like to share the program I’m doing.

Edit: I met great people here and hope to see you next year (again).

Continue reading “StreamHPC is at HiPEAC, Berlin” →

X86-workstation buying guide for OpenCL developers, Q1 2013

Posted by Vincent Hindriksen on 6 January 2013 with 7 Comments

Curved iMac has your back… — Nuno Teixeira designed a large curved monitor in 2008 and assumed it would never be made. For a “few” thousand dollar NEC offers one to you right now. Also Samsung and LG have announced several new curved TVs at CES 2013 (with hdmi-port). We only need a workstation to go with it, where this blog-article might come in handy.

Important: this article was written before Intel “Haswell” and AMD “Richland” architectures came out.

So you want to start developing for OpenCL? When you focus on developing OpenCL for X86, you have these three options: CPUs, GPUs and CPUs with and embedded GPU. This article is for you and represents the current state of hardware – if you want the best hardware for your specific algorithm, the below information is probably not sufficient.

In 2013 we focus on 3 groups: servers/cloud (FirePro, Tesla, XeonPhi), workstations (discussed here), low-power devices (SoCs) and special accelerators (FPGAs and DSPs). This article does not discuss high-end accelerators of a few thousands of Euro, which are laid out in here.

Before reading on, you need to set the goal for your workstation.

If you want to learn the basics of OpenCL-programming, first check if your current machine has OpenCL-support.
If you need more processing power, be sure you select the right hardware for the job. Don’t buy the most expensive hardware (FirePro, Tesla or XeonPhi), but take your time to find out which hardware supports your algorithms best. Feel free to ask us.
If you want to make sure your software works on various types of accelerators, you can choose between:
- swapping PCIe-cards – disadvantage is the drivers-hazzle and time-consumption.
- more accelerators in one machine – disadvantage is that only GPU 1 can do OpenGL/DirectX.
- identical machines with different accelerators – disadvantage is the price.
If you want to focus on multi-GPU development, you need:
- or enough power-supply and the motherboard supports many lanes,
- or buy a videocard with two GPUs.

This article has the goal to help you with buying a good machine for OpenCL-development. Prices are of January 2013. If you think I make the wrong suggestions, please give feedback via the comments.

My contacts at various companies can tell: I want to stay independent no matter what. No deals have been made nor was there any outside influence, except the friendly people of the local computer shops. I was surprised I ended up with suggestion so much AMD hardware, that I felt quite uncomfortable with it – I finally decided to keep to my first conclusions and leave the comments completely open.

Continue reading “X86-workstation buying guide for OpenCL developers, Q1 2013” →

The entanglement of Bitcoins and compute-capabilities

Posted by Vincent Hindriksen on 8 December 2012

Every now and then I read stories on Bitcoins (Wikipedia-article), as GPUs are used a lot to “mine” Bitcoins. They have some extensive benchmarks, and also their discussions giving me insights in specific parts of accelerators like GPUs. Also is this group very upwards if it comes to accepting new techniques. Today something changed: they are a bank now. One of the thoughts I had with this, I’d like to share with you.

If you look at various types of currencies, you see they all have various goals (trade, power, resources, energy, properties, etc). The inequality and differences are even more important than the amount. Various currencies are entangled to a certain goal or resource, but there is nothing entangled strongly to technology. Here is where Bitcoins come in…

Bitcoins are entangled with compute-power – a current benchmark for technological progress.

In this article I’d like to share how the tech-economy and Bitcoins are entangled, seen from the perspective of computing. I left out a lot of the “rules of economy” and hope you can put these in – the below text is just to guide you through the thought-process only. Disagreement is only good – as we learn all from it.

Continue reading “The entanglement of Bitcoins and compute-capabilities” →

The OpenCL power: offloading to the CPU (AVX+SSE)

Posted by Vincent Hindriksen on 28 November 2012 with 2 Comments

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;
}

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

Continue reading “The OpenCL power: offloading to the CPU (AVX+SSE)” →

AMD positions FirePro S10000 against both TESLA K10 (and K20)

Posted by Vincent Hindriksen on 15 November 2012

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)” →

Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

Posted by Vincent Hindriksen on 12 November 2012 with 3 Comments

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Continue reading “Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P” →

NVIDIA’s answer to FirePro S9000: the TESLA K20

Posted by Vincent Hindriksen on 12 November 2012 with 5 Comments

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Continue reading “NVIDIA’s answer to FirePro S9000: the TESLA K20” →

All OpenCL SDKs now in our Knowledge Base

Posted by Vincent Hindriksen on 31 October 2012 with 3 Comments

For who hasn’t seen the latest addition to our knowledge base, we have added a list of all (almost) available OpenCL-SDKs. You can find it in the menu under “Knowledge Base” -> “SDKs…“.

This list shows how important OpenCL is getting, as developers now can write compute-intensive parallel software on CPUs, GPUs, ARM-based accelerators and even FPGAs. This growth of OpenCL-devices is very exciting and important news, and that’s why it has got its own section on the site.

The the current list is (in random order):

AMD GPUs & CPUs
ZiiLabs ARM Tablet
Altera FPGA board – available in Q2/Q3 2013
Adapteva Parallella board – available in Q2/Q3 2013
Intel CPUs
Samsung Exynos 5 board – available in December 2012
IBM POWER-processor

Currently looking into:

Intel Xeon Phi
Nintendo Wii U dev
Sony Playstation 4 Orbis
Vivante
Xilinx
NVidia GPUs
Qualcomm

The SDK of NVIDIA is on the second list, what you maybe did not unexpected. We have to wait until they have put their official statement on what they are going to do with CUDA and OpenCL.

While you are there, also check the other parts of the Knowledge Base:

What is… -> Explanations of terminology. Put your requests in a comment.
Event&Talks -> A list of events which StreamHPC attends, give talks at and helps organise. Interesting for both managers and engineers.
Self Study – The part of the site most visited after the blog. This is for the engineers who want to start learning programming GPUs.

This section will be updated and extended continuously with information not available anywhere else.

StreamHPC has been in the OpenCL business since 2010 as one of the few. We have been the most visible and known OpenCL-specialist ever since.

Scientific Visualisation of Molecules

Posted by Vincent Hindriksen on 31 October 2012 with 2 Comments

In many hard sciences focus is on formulas and text, whereas images are mainly graphs or simplified representations of researched matters. Beautiful visualisations are mainly artist’s impressions in popular media targeting hobby-scientists. When Cyrille Favreau made the first good-working version of his real-time GPU-accelerated raytracer, he saw potential in exactly this area: beautiful, realistic visualisations to be used in serious science. This resulted in software called IPV.

He chose to focus on rendering molecules of proteins and this article discusses raytracing in molecular sciences, while highlighting the features of the software.

This project has been discussed on GPU Science, but this article looks at the the software from a slightly different perspective. If you don’t want to know how the software works and what it can do, scroll down for a download-link.

Continue reading “Scientific Visualisation of Molecules” →

Targetting various architectures in OpenCL and CUDA

Posted by Vincent Hindriksen on 23 October 2012

bigstock-Different-Technologies-and-Ope-15769229 — “Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Continue reading “Targetting various architectures in OpenCL and CUDA” →

OpenCL Videos of AMD’s AFDS 2012

Posted by Vincent Hindriksen on 12 October 2012 with 4 Comments

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012” →

OpenCL on Altera FPGAs

Posted by Vincent Hindriksen on 4 October 2012

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Continue reading “OpenCL on Altera FPGAs” →

4 October talk in Amsterdam on mobile compute

Posted by Vincent Hindriksen on 30 September 2012 with 2 Comments

Thursday 4 October I talk on mobile compute at Hackers&Founders Amsterdam on what mobile compute can do. The goal is to initiate new ideas for start-ups, as not many know their mobile phone and tablet is very powerful and next year can be used for compute intensive tasks.

The other talk is from Mozilla on Firefox OS (Edit: it was cancelled), which is actually reason enough to visit this Hackers&Founders Meetup. Entrance is free, drinks are not. Alternatively you could go to the Hadoop User Group Meetup at Science Park, Amsterdam.

Continue reading “4 October talk in Amsterdam on mobile compute” →

Avoiding false dependencies in only two steps

Posted by Vincent Hindriksen on 29 September 2012

Let’s approach the concept of programming through looking at the brain, the code and the computer.

The idea of a program lives in the brain of a programmer. The way to get the program to the computer is using a system process called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase (although this is still this mostly from brain to computer).

When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

In this article I want to discuss one of these new coding-paradigm: dependencies in parallel software.
Continue reading “Avoiding false dependencies in only two steps” →

Do you have GPU-brains? A poster-initiative.

Posted by Vincent Hindriksen on 17 September 2012 with 3 Comments

This is a message to GPU-programmers only.

It is a simple question, and has many answers: what are GPU-brains? How is it possible your brain can code GPUs and only few friends and colleagues understand what you are doing? Is it thinking in parallel, focusing on one kernel and having the architecture in the back of the head. Is it simple loop-unrolling? Is it a web of thoughts? Is it just cool, as not many people can do it?

Continue reading “Do you have GPU-brains? A poster-initiative.” →

Category: Technical

Why use the CPU for vector-computations?