Why use OpenCL on FPGAs?

Posted by Vincent Hindriksen on 16 September 2014 with 6 Comments

Altera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just fill in the form on Altera’s webpage.

In StreamHPC we’re interested in OpenCL on FPGAs for one reason: many companies run their software on GPUs, when they should be using FPGAs instead; and at the same time, others stick to FPGAs and ignore GPUs completely. The main reason, we think, is that converting CUDA to VHDL, or Verilog to CPU intrinsics, is simply too painful. Another reason can be seen in the a amount of investment put on a certain technology. We believe that OpenCL can solve both of these issues. OpenCL is much more portable and can be converted to a new architecture in a relatively short time (if the developer is familiar with the project, the hardware and OpenCL). We have high familiarity with these two latter, which means we’re used to get new projects up-and-running.

Since both Altera and Xilinx have invested in OpenCL, the two FPGAs code has become more portable now. Altera has a public SDK (and they’re proudly loud about it), while Xilinx offers it in their latest tools (although they’re unfortunately much more silent about it).

Now, let us now go back to the quotes from the book that we wanted to share with you.

Andrew Moore describes OpenCL effectively in just a few sentences:

The need for heterogeneous computing is leading to new programming languages to exploit the new hardware. One example is the OpenCL first developed by Apple, Inc. OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs, and other types of processors. OpenCL includes a language for developing kernels (functions that execute on hardware devices) as well as application programming interfaces (APIs) that define and control the various platforms. OpenCL allows for parallel computing using task-based and data-based parallelism.

The author also shares some interesting insights around the reasons why OpenCL should be used on FPGA:

FPGAs are inherently parallel, so they’re a perfect fit with OpenCL’s parallel computing capabilities. FPGAs give you an alternative to the typical data or task parallelism by offering a pipeline parallelism where tasks can be spawned in a push-pull configuration with each task using different data from the previous task with or without host interaction. OpenCL allows you to develop your code in the familiar C programming language but using the additional capabilities provided by OpenCL. These kernels can be sent to the FPGAs without your having to learn the low-level HDL coding practices of FPGA designers. Generally, there are several benefits for software developers and system designers to use OpenCL to develop code for FPGAs:

Simplicity and ease of development: Most software developers are familiar with the C programming language, but not low-level HDL languages. OpenCL keeps you at a higher level of programming, making your system open to more software developers.

Code profiling: Using OpenCL, you can profile your code and determine the performance-sensitive pieces that could be hardware accelerated as kernels in an FPGA.

Performance: Performance per watt is the ultimate goal of system design. Using an FPGA, you’re balancing high performance in an energy-efficient solution.

Efficiency: The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives.

Heterogeneous systems: With OpenCL, you can develop kernels that target FPGAs, CPUs, GPUs, and DSPs seamlessly to give you a truly heterogeneous system design.

Code reuse: The holy grail of software development is achieving code reuse. Code reuse is often an elusive goal for software developers and system designers. OpenCL kernels allow for portable code that you can target for different families and generations of FPGAs from one project to the next, extending the life of your code.

Today, OpenCL is developed and maintained by the technology consortium Khronos Group. Most FPGA manufacturers provide Software Development Kits (SDKs) for OpenCL development on FPGAs.

You can continue here if you want to read of this ebook. And of course, whenever you want to learn some more more, feel free to write to us, or follow this conversation on Twitter, which goes on through our special account: @OpenCLonFPGAs.

OpenCL support levels

Posted by Vincent Hindriksen on 4 July 2014 with 2 Comments

The below table shows the current state of OpenCL, SPIR and HSA for each vendor.

[table id=6 /]

EP = Embedded Profile, FP = Full Profile.

OpenCL support on recent Android smartphones

Posted by Vincent Hindriksen on 30 June 2014 with 2 Comments

There is more than one way (image by Pank Seelen)

The embedded world is so extremely flexible, because it is full of open standards. We therefore expect that big processor vendors will push harder than Google can push back. OpenCL-support is very important for GPGPU-libraries like ArrayFire, VexCL, ViennaCL – these can be ported to Android in less time.

Apple now has introduced Metal on iOS to increase the fragmentation even more. StreamHPC and friends are working hard on getting one language to have on all platforms, so we can build on bringing solutions to you. Understand that if OpenCL gets popular on Android, this increases the chance that it will get accepted on other mobile platforms like iOS and Windows Mobile/Phone.

On the other hand it is getting blocked wherever it can, as GPGPU brings unique apps. A RenderScript-only or Metal-only app is good for sales of one type of smartphone – good for them, bad for developers who want to target the whole market.

Getting the current status

To get more insight on the current situation, Pavan Yalamanchili of ArrayFire has created a spreadsheet (click here to edit yourself). It is publicly editable, so anybody can help complete it. Be clear about the version of Android you are running, as for instance in 4.4.4 there are possibly some blocks thrown up by Google. If you found drivers, but did not get OpenCL running, please put that in the notes. You can easily find out if your smartphone supports OpenCL, using this OpenCL-Info app. Thanks in advance of helping out!

Why not just RenderScript?

We think that RenderScript can be built on top of OpenCL. This helps allowing new programming languages and finding the optimal programming-solution faster than just trusting Google engineers – solving this problem is not about being smart, but about being open to more routes.

Same is for Metal, which even tries to replace both OpenCL and OpenGL. Again it is a higher level language which can be expressed in OpenGL and OpenCL.

Let’s see if Apple and Google serve their dedicated developers, or if we-the-developers must serve them. Let’s hope for the best.

Using async_work_group_copy() on 2D data

Posted by Vincent Hindriksen on 19 June 2014 with 4 Comments

When copying data from global to local memory, you often see code like below (1D data):
[raw]

if (get_group_id(0)==0) {
  for (int i=0; i < N; i++) {
      data_local[i] = data_global[offset+i]
  }
}
mem_fence(CLK_LOCAL_MEM_FENCE);

[/raw]
This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.

event_t async_work_group_copy (	__local gentype `*dst`,
	const __global gentype `*src`,
	size_t *data_`size`*,
	event_t `event`

event_t async_work_group_copy (	__global gentype `*dst`,
	const __local gentype `*src`,
	size_t *data_`size`*,
	event_t `event`

The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.

As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.

The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.

[raw]

kernel void using_local(const global float* dataIn, local float* dataInLocal) {
    event_t event;
    const int dataInLocalWidth = (offset*2 + get_local_size(0));
        
    for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
        event = async_work_group_copy(
             &dataInLocal[i*dataInLocalWidth],
             &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) 
                 + (get_group_id(0)*get_local_size(0)) - offset],
             dataInLocalWidth,
             event);
   }
   do_other_stuff(); // code that you can execute for free
   wait_group_events(1, &event); // waits until the copy has finished.
   use_data(dataInLocal);
}

[/raw]

On the host (C++), the most important part:
[raw]

cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) 
          * gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) 
          * (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), 
          cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);

[/raw]
This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.

This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.

Market Positioning of Graphics and Compute solutions

Posted by Vincent Hindriksen on 11 June 2014 with 4 Comments

When compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

	AMD	NVIDIA	Intel	ARM
Desktop User *	A-series APU	–	Iris / Iris Pro	–
Laptop User *	A-series APU	–	Iris / Iris Pro	–
Mobile User	–	Tegra	Iris	Mali T720 / T4xx
Desktop Gamer	Radeon	GeForce	–	–
Laptop Gamer	Radeon M	GeForce M	–	–
Mobile High-end	–	Tegra K (?)	Iris Pro	Mali T760 / T6xx
Desktop Graphics	FirePro W	Quadro	–	–
Laptop Graphics	FirePro M	Quadro M	–	–
Desktop (DP) Compute	FirePro W	Titan (hdmi) / Tesla (no video-out)	XeonPhi	–
Laptop (DP) Compute	FirePro M	Quadro M	XeonPhi	–
Server (DP) Compute	FirePro S	Tesla	XeonPhi (active cooling!)	–
Cloud	Sky	Grid	–	–

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on Hardware.info:

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

Double Precision GFLOPS
Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
Faster support
(Free) Developer Tools
System Configuration Software (click-click and compute works)
Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.

Conclusion

Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

Valgrind suppression file for AMD64 on Linux

Posted by Vincent Hindriksen on 5 June 2014 with 2 Comments

Valgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux” →

Building a 150 TFLOPS cluster with Accelerators in 2014

Posted by Vincent Hindriksen on 8 April 2014

You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

So, for this year it will be K40. Here’s an overview:

Peak 1.43 DP TFLOPS theoretical
Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
5.65 GFLOPS/Watt DGEMM
Needs 122 GPUs to get 150 TFLOPS DGEMM
Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

	FirePro W9100	FirePro W9000	FirePro S9150
Shader count	2816	2048	2816
Mem size	16 GByte	6 GByte	16 GByte
mem-type	GDDR5	GDDR5	GDDR5
Interface	512 Bit	384 Bit	512 Bit
Transferrate	320 GByte/s	264 GByte/s	320 GByte/s
TDP	275 Watt	274 Watt	225 Watt (-22%)
Connectors	6 × MiniDP, 3D-Stereo, Frame-/ Genlock	6 × MiniDP, 3D-Stereo, Frame-/ Genlock	?
Multimonitor	yes (6)	yes (6)	Don’t care
SP/DP (TFlops)	5.24 / 2.62	3.99 / 1.0	4.1 / 2.0 (-22%)
ECC	yes	yes	yes
OpenCL 2.0	yes	no	yes
Price	$3999 USD	$2999 USD	?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

Peak 2.0 DP TFLOPS theoretical
Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
7.1 GFLOPS/Watt DGEMM
Needs 94 GPUs to get 150 TFLOPS DGEMM
No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

Peak 1.21 DP TFLOPS theoretical
Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
3.56 GFLOPS/Watt DGEMM
Needs 140 Phi’s to get 150 TFLOPS DGEMM
Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

The winner

Since there are only three offers, they are all winners. What matters is the order.

AMD FirePro – 16GB with its fast memory, is the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Intel promotes OpenCL as THE heterogeneous compute solution

Posted by Vincent Hindriksen on 25 March 2014 with 4 Comments

At Intel they have CPUs (Xeon, Ivy Bridge), GPUs (Isis) and Accelerators (Xeon Phi). OpenCL enables each processor to be used to the fullest and they now promote it as such. Watch the below video and see their view on why OpenCL makes a difference for Intel’s customers.

This is important, because till recently Intel was more pushing OpenMP and their proprietary solutions. I think it has something to do with the specialised processors that can be programmed with OpenCL, such as DSPs and FPGAs. Intel has always made generic processors that solve problems best for most. Customers of OpenCL happen to be the ones that could not be served with generic processors and preferred FPGAs and DSPs, before they tried GPUs. By showing that Intel can do OpenCL, they show they are a trustworthy partner to handle the problems in a few years, when the current problems can be handled by Intel processors.

Of course the Xeon Phi is also a good reason. The latest drivers have shown a huge improvement in performance, and that has increased Intel’s confidence in OpenCL for sure.

At StreamHPC we are very happy that Intel now openly promotes OpenCL and invests in it – this will increase trust in the programming language.

A small side-note. The differences between the Windows-drivers and Linux-drivers are somewhat vague: under Linux, the CPU is visible, but not supported officially. This makes development of multi-processor software not as straightforward as discussed in the video. Probably this will be more extensive in the future, as Intel only officially supports OpenCL on a processor when it’s very stable.

Big announcements: SYCL 1.2, WebCL 1.0 and OpenCL 2.0

Posted by Vincent Hindriksen on 19 March 2014

Khronos just announced three OpenCL based releases:

SYCL 1.2 Provisional Spec – Abstraction Layer for Leveraging C++ and OpenCL
WebCL 1.0 Final Spec – JavaScript bindings to OpenCL
OpenCL 2.0 Adopters Program – Conformance for OpenCL 2.0 implementations

Below I’ve quoted the summaries. For each of these I’ve prepared articles, but due to lack of time haven’t been able to finish and publish them. So for now some remarks after the summaries.

Khronos Releases SYCL 1.2 Provisional Specification

Programming abstraction layer to enable applications and high-level frameworks to leverage C++ and OpenCL for heterogeneous parallel acceleration

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback. SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++. For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration – and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

https://www.khronos.org/news/press/khronos-releases-sycl-1.2-provisional-specification

Higher level languages are very important, as OpenCL is simply too low-level. SYCL is another effort to help researching & improving this area, as we haven’t found the holy grail. Languages like C++AMP and RenderScript claim they can replace OpenCL, but we all know that some implementations of those languages have been done on top of OpenCL.

Khronos Releases WebCL 1.0 Specification

JavaScript bindings to OpenCL brings heterogeneous parallel computing to Web browsers

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the ratification and public release of the WebCL™ 1.0 specification. Developed in close cooperation with the Web community, WebCL extends the capabilities of HTML5 browsers by enabling developers to offload computationally intensive processing to available computational resources such as multicore CPUs and GPUs. WebCL defines JavaScript bindings to OpenCL™ APIs that enable Web applications to compile OpenCL C kernels and manage their parallel execution. Like WebGL™, WebCL is expected to enable a rich ecosystem of JavaScript middleware that provides access to accelerated functionality to a wide diversity of Web developers.

https://www.khronos.org/news/press/khronos-releases-webcl-1.0-specification

WebCL gets more and more attention, even before it was even official. It would be interesting to see the same growth to higher level language as we have with OpenCL now. for this reason we started the Learning WebCL website, to help you learn WebCL in the future.

Khronos Launches OpenCL 2.0 Adopters Program

Conformance tests now available to certify OpenCL 2.0 implementations

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the availability of the official conformance test suite for the OpenCL 2.0 specification, making it possible for implementers to certify that their implementations are officially conformant thorough the Khronos OpenCL Adopters Program. Khronos has also released a set of header files for OpenCL 2.0 and an updated specification with a number of clarifications and corrections to the specification first released in November 2013.

https://www.khronos.org/news/press/khronos-launches-opencl-2.0-adopters-program

Finally the headers are open. Stay tuned for an extensive OpenCL 1.2 vs OpenCL 2.0 comparison, which I have prepared but were unable to finish without the header files.

I hope you are as happy with these announcements as I am. This tells me that OpenCL is ready for real business.

Khronos Invites Press & Game Developers to Sessions @ GDC San Francisco

Posted by Vincent Hindriksen on 28 February 2014

Khronos just sent out the below message to Press and Game Developers. To my understanding, there are many game devs under the readers of this blog, so I’d like you to share the message with you.

JOIN KHRONOS GROUP AT GDC 2014 SAN FRANCISCO
Press Conference, Technology Sessions and Refreshment OasisWe invite you to attend one or more of the Khronos sessions taking place in the Khronos meeting room just off the Moscone show floor. For detailed information on each session, and to register please visit: https://www.khronos.org/news/events/march-meetup-2014.

PRESS CONFERENCE

WHEN: Wednesday March 19 at 10:00 AM (Reception 9:30 AM)

WHERE: Room 262, West Mezzanine Level, (behind Official Press Room)

GUESTS: Members of the Press and Industry by Invitation*

RSVP: Jon Hirshon, Horizon PR jh@horizonpr.com

Members of the press are invited to attend the Khronos Press Conference, held jointly again this year with consortium PCGA (PC Gaming Alliance). Khronos will issue significant news on OpenGL ES, WebCL, OpenCL, and several more Khronos technologies, and PCGA will issue news about 2013 Gaming Market numbers. Updates will be delivered by Khronos and PCGA Executives, with insights made by David Cole of DFC and Jon Peddie of Jon Peddie Research.

DEVELOPER SESSIONS

WHEN: Wednesday March 19 & Thursday March 20; Session Times Below

WHERE: Room 262 – West Mezzanine Level, Moscone Center

GUESTS: All GDC attendees**

RSVP: https://www.khronos.org/news/events/march-meetup-2014

All GDC attendees** are invited to the Khronos Developer Sessions where experts from the Khronos Working Groups will deliver in-depth updates on the latest developments in graphics and media processing. These sessions are packed with information and provide a great opportunity to:

Hear about the latest updates from the gurus that invented these technologies

See leading-edge demos & applications

Put your questions to members of the Khronos working groups

Meet with other community members

SESSION SCHEDULE

Wednesday March 19

3:00 – 4:00 : OpenCL & SPIR

4:00 – 5:00 : OpenVX, Camera and StreamInput

5:00 – 6:00 : OpenGL ES

6:00 – 7:00 : OpenGL

Thursday March 20

3:00 – 3:50 : WebCL

4:00 – 4:50 : Collada and glTF

5:00 – 7:00 : WebGL

SESSION REGISTRATION
For information and to register, visit:https://www.khronos.org/news/events/march-meetup-2014

REFRESHMENT OASIS

WHEN: Wednesday March 19 & Thursday March 20; from 10 AM to 7 PM

WHERE: Room 270, West Mezzanine Level

GUESTS: All GDC attendees**

RSVP: https://www.khronos.org/news/events/march-meetup-2014

We thought “Refreshment Oasis” sounded like a nice way to say “sit down and have a cup of coffee while we keep working!” Khronos is happy to offer a hospitality suite conveniently located next to our primary meeting room (and the official GDC Press room) to showcase Khronos Member technology demos and offer a place for GDC guests, Khronos Members and Marketing staff to meet. You are welcome to just drop by for a chat, or please email Michelle@GoldStandardGroup.org to arrange a meeting with any Work Group Chairs, Khronos Execs or Marketing Team.

We look forward to seeing you at the show!

*Admittance to the Press Conference is open to all GDC registered Press, and to members of industry on a “Seating Available” basis. Space is limited so reserve your seat today.

** Admittance to the KHRONOS sessions is FREE but: (1) all attendees must have a GDC Exhibitor or Conference Pass to gain entry to the Khronos meeting room area (GDC tickets details http://www.gdconf.com) and (2) all attendees MUST REGISTER for the individual Khronos API sessions. We expect demand to be high and space is limited.

With open standards becoming more important in the very diverse computer-game industry, Khronos is also growing. If you are in this industry and want to know (or influence) the landscape for the coming years, you should attend.

Commodity and Open Standards – why OpenCL matters

Posted by Vincent Hindriksen on 25 February 2014 with 1 Comment

This article actually discusses the question: is GPGPU a solution for the masses, or is it for niche-products? For the latter open standards matter a lot less, as you will read.

If you watch the below video on sale&marketing by Victor Antonio, then you get what is so difficult about open standards: It pushes all companies using the standard into a focus on becoming the best. Indeed, survival of the fittest may be the base of (true) capitalism and giving the best products. Problem is that competition on price is not safe for the future of the company.

The key is specialisation, or creating unique value. The below video discusses this. The difference between “a feature” and “unique value” is a discussion on its own, you really should have with your team on your own products. Continue reading “Commodity and Open Standards – why OpenCL matters” →

OpenCL hardware test centre @ StreamHPC

Posted by Vincent Hindriksen on 18 February 2014

Wanting to test your OpenCL-software on specific hardware? From now on, that is possible by logging in on StreamHPCs test-servers.

Update: new prices. Got feedback it should compete with Amazon EC2.

During the beta-period (February to April) we have available:

[list1]

Dual FirePro S10000 (total ~11.8 TFLOPS single precision, ~2.9 TFLOPS double precision).
~~Several embedded GPU-boards attached.~~

[/list1]

By request more servers will be added, but for now we start lean.

You’ll get:

[list1]

64 bit Linux server.
A secure SSH access with chrooted harddisk space.
Full NDA in case StreamHPC staff needs to assist.
Both shared and dedicated usage possible.
Assistance via mail and skype.

[/list1]

The costs for shared hours are €1,- per hour, for private hours ~~€10,-~~ €5,- per hour. Discounts are available for academics and existing customers (also from past trainings) – just contact us. Goal of the shared hours is to setup the software, do basic tests, but not to do intensive benchmarks.

We hope you can now make a better decision what hardware to choose, or to have your paper’s conclusions based on more recent hardware.

Want more info or have special requests? Call +31854865760 (office) or +31645400456 (cell), or use the contact form.

video: OpenCL on Android

Posted by Vincent Hindriksen on 11 February 2014 with 1 Comment

Michael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

What is OpenCL
13 dwarfs
RenderScript
Demo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

Heterogeneous Systems Architecture (HSA) – the TL;DR

Posted by Vincent Hindriksen on 5 February 2014

HSASolutionStack — Legacy-apps run on HSA-hardware, but less optimal.

The main problem of discrete GPUs is that memory needs to be transferred from CPU-memory to GPU-memory. Luckily we have SoCs (GPU and CPU in one die), but still you need to do in-memory transfers as the two processors cannot access memory outside their own dedicated memory-regions. This is due the general architecture of computers, which did not take accelerators into account. Von Neumann, thanks!

HSA tries to solve this, by redefining the computer-architecture as we know it. AMD founded the HSA-foundation to share the research with other designers of SoCs, as this big change simply cannot be a one-company effort. Starting with 7 founders, it has now been extended to a long list of members.

Here I try to give an overview of what HSA is, not getting into much detail. It’s a TL;DR.

What is Heterogeneous Systems Architecture (HSA)?

It consists mainly of three parts:

new memory-architecture: hUMA,
new task-queueing: hQ, and
an intermediate language: HSAIL.

hsa-overview — HSA enables tasks being sent to CPU, GPU or DSP without bugging the CPU.

The basic idea is to give GPUs and DSPs about the same rights as a CPU in a computer, to enable true heterogeneous computing.

hUMA (Heterogeneous Uniform Memory Access)

HSA changes the way memory is handled by eliminating a hierarchy in processing-units. In a hUMA architecture, the CPU and the GPU (inside the APU) have full access to the entire system memory. This makes it a shared memory system as we know it from multi-core and multi-CPU systems.

HSA-shared-mem-supersimplified — This is the super-simplified version of hUMA: a shared memory system with CPU, GPU and DSP having equal rights to the shared memory.

hQ (Heterogeneous Queuing)

HSA gives more rights to GPUs and DSPs, leveraging work from the CPU. Compared to the Von Neumann architecture, the CPU is not the Central Processing Unit anymore – each processor can be in control and create tasks for itself and the other processors.

heterogeneous-queing — HSA-processors have control over their own and other application task queues.

HSAIL (HSA Intermediate Language)

HSAIL is a sort of virtual target for HSA-hardware. Hardware-vendors focus on getting HSAIL compiled to their processor instruction sets, and developers of high-level languages target HSAIL in their compilers. This is a proven concept of evolving complex hardware-software projects.

It is pretty close to OpenCL SPIR, which has comparable goals. Don’t see them as competitors, but two projects which both need different freedoms and will work along.

What is in it for OpenCL?

OpenCL 2.0 has support for Shared Virtual Memory, Generic Address Space and Recursive Functions. All supported by HSA-hardware.

OpenCL-code can be compiled to SPIR, which compiles to HSAIL, which compiles to HSA-hardware. When the time comes that HSAIL starts supporting legacy hardware, SPIR can be skipped.

HSA is going to be supported in OpenCL 1.2 via new flags – watch this thread.

Final words

Two companies not there: Intel and Nvidia. Why? Because they want to do it themselves. The good news is that HSA is large enough to define the new architecture, making sure we get a standard. The bad news is that the two outsiders will come up with an exception for whatever reason, which gives a need for exceptions in compilers.

You can read more on the website of the HSA-foundation or ask me in the comments below.

PRACE Spring School 2014

Posted by Vincent Hindriksen on 28 January 2014

On 15 – 17 April 2014 a 3-day workshop around HPC is organised. It is free, and focuses on bringing industry and academy together.

Research Institute for Symbolic Computation (RISC) / Johannes Kepler University Linz Kirchenplatz 5b (Castle of Hagenberg) 4232 Hagenberg Austria

The PRACE Spring School 2014 will take place on 15 – 17 April 2014 at the Castle of Hagenberg in Austria. The PRACE Seasonal School event is hosted and organised jointly by the Research Institute for Symbolic Computation / Johannes Kepler University Linz (Austria), IT4Innovations / VSB-Technical University of Ostrava (Czech Republic) and PRACE.

The 3-day program includes:

A 1-day HPC usage for Industry track bringing together researchers and attendees from industry and academia to discuss the variety of applications of HPC in Europe.

Two 2-day tracks on software engineering practices for parallel & emerging computing architectures and deep insight into solving multiphysical problems with Elmer on large-scale HPC resources with lecturers from industry and PRACE members.

The PRACE Spring School 2014 programme offers a unique opportunity to bring users, developers and industry together to learn more about efficient software development for HPC research infrastructures. The program is free of charge (not including travel and accommodations).

Applications are open to researchers, academics and industrial researchers residing in PRACE member countries, and European Union Member States and Associated Countries. All lectures and training sessions will be in English.

Applications are open to researchers, academics and industrial researchers residing in PRACE member countries, and European Union Member States and Associated Countries. All lectures and training sessions will be in English. Please visit http://prace-ri.eu/PRACE-Spring-School-2014/ for more details and registration.

At StreamHPC we support such initiatives.

First Khronos Chapter meeting in Amsterdam: WebGL/OpenGL

Posted by Vincent Hindriksen on 16 January 2014

Thursday 13 February 2014 the first Khronos meetup in will take place. We expect a small group, so the location will be cozy and there will be enough time to talk with a beer. First round is on me, admission is free.

Goal is to learn about open media-standards from Khronos and others. So when OpenCV is discussed, we’ll also talk OpenVX. The target group is programmers and Indy developers who are interested in creating multi-OS and multi-device software.

Program

I am very thrilled to tell that Ton Roosendaal of the Blender Foundation will talk about the releationship between his Blender and Khronos OpenGL.

Second Maarten and Jurjen of ThreeDee Media will talk about WebGL, from a technical and a market view. Is WebGL ready for prime-time?

Then you can show your stuff. For that I’ll bring a good laptop with Windows 8.1 and Ubuntu 13.10 64.

Prepare for Meetup today!

See the Meetup-page for more information. See you there!

OpenCL alternatives for CUDA Linear Algebra Libraries

Posted by Vincent Hindriksen on 16 January 2014 with 4 Comments

While CUDA has had the advantage of having many more libraries, this is no longer its main advantage if it comes to linear algebra. If one thing changed over the past year, then it is linalg library-support for OpenCL. The choices have been increased at a continuous rate, as you can see the below list.

A general remark when using these libraries. When using them you need to handle your data-transfers and correct data-format, with great care. If you don’t think it through, you won’t get the promised speed-up. If not mentioned, then free.

Subject	CUDA	OpenCL
FFT	CUFFT The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the…	clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.
Linear Algebra	MAGMA MAGMA is a collection of next generation, GPU accelerated ,linear algebra libraries. Designed for heterogeneous GPU-based architectures. It supports interfaces to current LAPACK and BLAS standards.	clMAGMA is an OpenCL port of MAGMA for AMD GPUs. The clMAGMA library dependancies, in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for AMD hardware, can be found in the AMD Accelerated Parallel Processing Math Libraries (APPML).
Sparse Linear Algebra	CUSP CUSP is an open source C++ library of generic parallel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs. CUSP provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.	clBLAS implements the complete set of BLAS level 1, 2 & 3 routines. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release ViennaCL 1.5.0 provides many new convenience functions and support for integer vectors and matrices.VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported.
Random number generation	cuRAND The NVIDIA CUDA Random Number Generation library (cuRAND) delivers high performance GPU-accelerated random number generation (RNG). The cuRAND library delivers high quality random numbers 8x…	The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL). They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.
	CUDA Math Library The CUDA Math library is an industry proven, highly accurate collection of standard mathematical functions. Available to any CUDA C or CUDA C++ application simply by adding “#include math.h” in…	Looking into the details of what the CUDA math lib exactly is.
AI	GPU AI for Board Games A technology preview with CUDA accelerated game tree search of both the pruning and backtracking styles. Games available: 3D Tic-Tac-Toe, Connect-4, Reversi, Sudoku and Go.	There are many tactics to speed up such algorithms. This CUDA-library can therefore only be used for limited cases, but nevertheless it is a very interesting research-area. Ask us for an OpenCL based backtracking and pruning tree searching, tailored for your problem.
Dense Linear Algebra	EM Photonics CULA Tools Provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. Also provides various solvers. Free (with limitations) and commercial.	See ViennaCL, VexCL and clBLAS above. Kudos to the CULA-team, as they were one of the first with a full GPU-accelerated linear algebra product.
Fortran	IMSL Fortran Numerical Library The IMSL Fortran Numerical Library is a comprehensive set of mathematical and statistical functions available that offloads CPU work to NVIDIA GPU hardware where the cuBLAS library is utilized. Free (with limitations) and commercial.	OpenCL-FORTRAN is not available yet. Contact us, if you have interest and wish to work with a pre-release once available.
Subject	AccelerEyes ArrayFire Comprehensive GPU function library, including functions for math, signal processing, image processing, statistics, and more. Interfaces for C, C++, Fortran, and Python. Integrates with any CUDA-program. Free (with limitations) and commercial.	ArrayFire 2.0 is also available for OpenCL. Note that currently fewer functions are supported in the OpenCL-version than are supported in CUDA-ArrayFire, so please check the OpenCL documentation for supported feature list.Free (with limitations) and commercial.
Subject	NVIDIA Performance Primitives The NVIDIA Performance Primitives library (NPP) is a collection of over 1900 image processing primitives and nearly 600 signal processing primitives that deliver 5x to 10x faster performance than…	Kudos for NVIDIA for bringing it all at one place. OpenCL-devs have to do some googling for specific algorithms.

So the gap between CUDA and OpenCL is certainly closing. CUDA provides a lot more convenience, so OpenCL-devs still have to keep reading blogs like this one to find what’s out there.

As usual, if you have additions to this list (free and commercial), please let me know in the comments below or by mail. I also have a few more additions to this list myself – depending on your feedback, I might represent the data differently.

European HPC Magazines

Posted by Vincent Hindriksen on 10 January 2014

If one thing can be said about Europe, is that it is quite diverse. Each country solves or fails to solve its own problems individually, while European goals are not always well-cared for. Nevertheless, you can notice things changing. One of the areas where things have changed, is that of HPC. HPH has always been a well-interconnected research in Europe (with its centre in CERN), but there is a catch-up going on for the European commercial market. The whole of Europe has new goals set for better collaboration between companies and research institutes with programs like Horizon 2020. This means that it becomes necessary to improve interconnections among much larger groups.

In most magazines HPC is a section of a broader scope. This is also very important as this introduces HPC to more people. Now, I’d like to concentrate on the focus magazines. There are mainly two magazines available: Primeur Magazine and HPC Today.

Primeur Magazine

De Netherlands based magazine Primeur-magazine has been around for years, with HPC-news from Europe, video-channel, knowledge-base, calendar and more. Issues of past weeks can be read online (for free), but news can also be delivered via a weekly e-mail (paid service, prices range from €125 to €4000 per company/institute, depending on size).

They focus on being a news-channel for what is going on in in the HPC-world, both in the EU and the US. Don’t forget to follow them on Twitter.

HPC Today

Update: the magazine changed its name from HPC Magazine to HPC Today.

With several editions (Americas, Europe and France), websites and TV channels, the France based HPC Today brings an actionable coverage of the HPC and Big Data news, technologies, uses and research. Subscriptions are free, as the magazine is paid-for by advertisements. They balance their articles by targeting both people who deeply understand malloc() and people who want to know what is going on. Their readers are developers and researchers from both academic and private sectors.

With the change to HPC Today the content has slightly changed according the requests from the readers: less science, more HPC news. For the rest it’s about the same.

To get an idea of how they’re doing, check the partners of HPC Magazine: Teratec, ISC events and SC conference.

Other European HPC sources

Not all information around the web is nicely bundled in a PDF. Find a small list below to help you start.

InSiDE

The German National Supercomputing Centers HLRS, LRZ, NIC publish the online magazine InSiDE (Innovatives Supercomputing in Deutschland) twice a year. The articles are available in html and PDF. It gives a good overview of what is going on in Germany and Europe. There are no ways to subscribe via e-mail, so it would be better to put it in your calendar.

e-IRG

The e-Infrastructure initiative‘s main goal is to support the creation of a political, technological and administrative framework for an easy and cost-effective shared use of distributed electronic resources across Europe.

e-IRG is not a magazine, but it is a good start to find information about HPC in Europe. Their knowledge-base is very useful when trying to get an overview what is there in Europe: Projects, Country-statistics, Computer centers and more. They closely collaborate with Primeur-magazine, so can you see some overlap in the information.

PRACE Digest

PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery, as well as engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to achieve this mission by offering world class computing and data management resources and services through a peer review process.

The PRACE digest appears as twice a year as a PDF.

More?

Did we miss an important news-source or magazine? Let us know in the comments below!

VectorFabrics: 2014 will be parallel

Posted by Vincent Hindriksen on 9 January 2014

Toolmaker VectorFabrics sent 9 predictions for this year in their newsletter. I’d like to share it with you.

Nine predictions for 2014 that prove the programming landscape is changing

It is not hard to predict that this year will see a lot of activity around multicores and manycores. 2014 will be the year that software has to catch up with highly concurrent hardware. So we expect to see some major changes in how people view multicore programming:

Neither Intel, AMD nor Qualcomm releases any new single core processors in 2014. Therefore, it is less and less acceptable to release pure sequentially operating applications.

Intel releases a 15-core Xeon. On a typical 4-socket motherboard your OS sees 120 available cores. OpenMP is the preferred programming paradigm on such a platform for data-intensive shared-memory calculations. You have to deal with performance bottlenecks, including Amdahl’s law, cache performances and memory bandwidth issues.

Next to ARM big.LITTLE systems, 2014 sees the first true octacore cell phones and tablets. At that point, it becomes painfully clear that applications need changes to benefit from so many cores. Both true octacore and big.LITTLE processors see very little adoption in mobile devices as long as software that can benefit is missing.

At least one major mobile phone vendor loses market share because their hardware may be good, but the software (especially the web browser) cannot utilize the hardware to the max.

Both the XBox One and PlayStation 4 feature AMD Jaguar octacore processors with GP-GPUs. Very few games will using all the compute power, and customers will wonder why to upgrade as the performance difference to their existing console is not so different.

Two great open standards, OpenACC and OpenMP see a nice boost in adoption thanks to upcoming support in the latest open source compilers. Clang 3.5 features OpenMP 4.0 support. In addition GCC 4.9 also receives OpenACC support.

In the mobile space, offloading to GP-GPUs is hot as new architectures from Qualcomm Adreno 420, Nvidia Tegra K1 and Imagination PowerVR Series 6 will each allow offloading. Managing and programming the offloading will remain a problem: OpenCL? OpenACC? CUDA? RenderScript?

Major desktop applications jump on the compute-offloading bandwagon to win performance, using either OpenCL or OpenACC

Programmability of the next-gen Intel Xeon Phi (Knights Landing) will raise some eyebrows. This 2015 chip will have 72 Atom Silvermont cores with local memory, cache and up to 384 GB shared memory. This is outside the comfort zone of most programmers.

You guessed right: their tool has to do with parallel programming.

What are your predictions for 2014?

ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15

Posted by Vincent Hindriksen on 30 December 2013

According to the latest newsletter of the Mont-Blanc Project, it was explained that the GPU on a Samsung Exynos 5 is much faster and greener than its CPU: 3.5 times faster with half the energy. They built a supercomputer using 810 Exynos SoCs, that can deliver a 26 TFLOPS of peak performance. With the upcoming mobile GPUs becoming exponentially faster, they have all the expertise to build an even faster and greener ARM-supercomputer after this.

The Mont-Blanc compute cards deliver considerably higher performance; at 50% lower energy consumption, compared with previous ARM-based developer platforms.

The Mont-Blanc prototype is based on the Samsung Exynos 5 Dual SoC, which integrates a dual-core ARM Cortex-A15 and an on-chip ARM Mali-T604 GPU, and has been featured and market proven in advanced mobile devices. The dual-core ARM Cortex-A15 delivers twice the performance of the quad-core ARM Cortex-A9, used in the previous generation of ARM-based prototype, whilst consuming 20% less energy for the same workload. Furthermore, the on-chip ARM Mali-T604 GPU provides 3.5 times higher performance than the dual-core Cortex-A15, whilst consuming half the energy for the same workload.

Each Mont-Blanc compute card integrates one Samsung Exynos 5 Dual SoC, 4 GB of DDR3-1600 DRAM, a microSD slot for local storage and a 1 GbE NIC, all in an 85x56mm card (3.3×2.2 inches). A single Mont-Blanc blade integrates fifteen Mont-Blanc compute cards and a 1 GbE crossbar switch, which is connected to the rest of the system via two 10 GbE links. Nine Mont-Blanc blades fit into a standard BullX 9-blade INCA chassis. A complete Mont-Blanc rack hosts up to six such chassis, providing a total of 1620 ARM Cortex-A15 cores and 810 on-chip ARM Mali-T604 GPU accelerators, delivering 26 TFLOPS of peak performance.

“We are only scratching the surface of the Mont-Blanc potential”, says Alex Ramirez, coordinator of the Mont-Blanc project. “There is still room for improvement in our OpenCL algorithms, and for optimizations, such as executing on both the CPU and GPU simultaneously, or overlapping MPI communication with computation.”

Continue reading “ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15” →

Professional and Consumer Media Software using OpenCL

Posted by Vincent Hindriksen on 28 December 2013

More and more professional media software now has support for OpenCL. It starts to be a race where you cannot stay behind. If the competitor runs more than twice as fast on the same hardware, then you just can’t say “Sorry, you should buy NVIDIA hardware”. I expected this to happen, but could not tell in what industry they would run fastest. Seems it is fluid dynamics, video-editors and photo-editors.

AMD and Intel mostly have been selected as collaboration partners. Apple has been a main drive, especially with the introduction of their new MAC Pro with two high-end AMD FirePro GPUs.

Sony Catalyst Family

Sony released three new software packages to support video professionals in pre- and post-production.

This new family of products, Catalyst Browse (media management), Catalyst Prepare (video preproduction assistant) and Catalyst Edit (4K and Sony RAW video editing) has OpenCL support from the start.

Colorfront Express Dailies and On-Set Dailies

This software is an on-set dailies processing system (playback and sync, QC, colour grading, audio and metadata management).

The 2014 versions have OpenCL support in their transcoder plugin, Transkoder.

RED REDCINE-X PRO

REDCINE-X is a coloring toolset, integrated timeline, and post effects collection in a professional, flexible environment for your 4K or 5K .R3D files. RED has added support for OpenCL in build 22.

The Foundry Nuke Blink framework

As presented on GPUconf, The Foundry has opened their framework for running OpenCL kernels. It creates OpenCL-kernels (optimised for AMD or NVIDIA) from C++ Blink kernels.

NUKE studio is a node-based VFX, editorial and finishing studio. As with most products on this page, look for the “reel” to get a nice demo of its capabilities.

Magix Hybrid Video Engine

Video Deluxe and Movie Edit both have OpenCL support since 2012, thanks to the new shared video engine.

Adobe CS6 creative suite

Adobe has entered the OpenCL market publicly with . With Premiere Pro (video editing) and Photoshop (photo-editing) two main products with advanced GPU-acceleration via OpenCL.

Video on GPU-effects on Premiere Pro CS5.

FAQ on GPU-acceleration on Photoshop CS6.

Sony Vegas Pro

Vegas Pro is a video editing software package for non-linear editing systems, and has OpenCL support since version 10d. Also in the consumer version (Sony Movie Studio) there is OpenCL-support.

RealFlow Hybrido2 engine

RealFlow is fluid dynamics software and its new engine Hybrido2 has support for OpenCL since this year. And you just have to love their commercial videos.

Autodesk Maya

Maya is a toolsets to help create and maintain the modern, open pipelines you need to address today’s challenging 3D animation, visual effects, game development and post-production projects. Since the 2013 version it is accelerated for physics simulations via Bullet and OpenCL.

ArcSoft SimHD and Sim3D engine

ArcSoft media-engines SimHD and Sim3D have OpenCL support since several years and are used in several of their products.

BlackMagic Design

BMD has two suites which use OpenCL, Resolve and Fusion. DaVinci was acquired in 2009 and EyeOn in 2014.

(DaVinci) Resolve

Resolve has real-time colour correction thanks to OpenCL.

(Eyeon) Fusion

Fusion is an image compositing software program created by eyeon Software Inc. It is typically used to create visual effects and digital compositing for film, HD and commercials.

It uses OpenCL since version 6.

Roxio Creator Suite

Roxio uses OpenCL for accelerated rendering in their suite. They were one of the first to implement OpenCL – I think already in 2010, before OpenCL was even cool.

Unluckily they don’t have much information – just a mention that they have support.

Apple Final Cut Pro and iMovie

Apple has support in Final Cut Pro X, Motion 5 and Compressor 4.

Also iMovie works a lot faster when you have an OpenCL capable MAC.

Blender Cycles & Bullet

You cannot find any demonstration of new video hardware without Big Bucks Bunny, the short CG movie created with Blender.

It uses OpenCL in two parts: physics simulations (Bullet) and compositor (Cycles).

Side Effects Houdini

Houdini is a procedural node based 3D animation and visual effects tools for film, broadcast, entertainment and visualisation production.

DigitalFilmTools

There is support for OpenCL in zMatte, Composite Suite Pro and Film Stocks since Q4 2013.

zMatte is a keyer for blue and green screen composites. Composite Suite Pro is a collection of visual effects plug-ins. Film Stocks simulates color and black and white still photographic film stocks, motion picture films stocks and historical photographic processes.

OTOY OctaneRender 3

OctaneRender is a GPU-based, real-time 3D, unbiased rendering application. In March 2015 OTOY announced OctaneRender 3, which has full OpenCL support:

OpenCL support: OctaneRender 3 will support the broadest range of processors possible using OpenCL to run on Intel CPUs with support for out-of-core geometry, OpenCL FPGAs and ASICs, and AMD GPUs.

Below is a reel of OcateRender 2 with CUDA. According to OTOY the performance on AMD and NVidia is comparable.

SAM Alchemist XF

Alchemist XF supports format and framerate conversion from SD up to 4K for a wide variety of file formats at high speed.

More?

There is a lot more OpenCL-powered software coming up rapidly (we hear things). But we also missed (or accidentally forgot) software. Please help making this list complete and send us an email.

StreamHPC communications

Getting the current status

Why not just RenderScript?

Positioning of the Titan Z

Differentiating from low-end compute

Evolving from gamer-compute to server-compute

Conclusion

NVIDIA Tesla

AMD FirePro

Intel Xeon Phi

Altera FPGAs

Conclusion

Open standards

Green 500

The winner

Need to port CUDA to extremely fast OpenCL? Hire us!

Khronos Releases SYCL 1.2 Provisional Specification

Khronos Releases WebCL 1.0 Specification

Khronos Launches OpenCL 2.0 Adopters Program

What is Heterogeneous Systems Architecture (HSA)?

hUMA (Heterogeneous Uniform Memory Access)

hQ (Heterogeneous Queuing)

HSAIL (HSA Intermediate Language)

What is in it for OpenCL?

Final words

Program

Prepare for Meetup today!

Primeur Magazine

HPC Today

Other European HPC sources

InSiDE

e-IRG

PRACE Digest

More?

Nine predictions for 2014 that prove the programming landscape is changing

Sony Catalyst Family

Colorfront Express Dailies and On-Set Dailies

RED REDCINE-X PRO

The Foundry Nuke Blink framework

Magix Hybrid Video Engine

Adobe CS6 creative suite

Sony Vegas Pro

RealFlow Hybrido2 engine

Autodesk Maya

ArcSoft SimHD and Sim3D engine

BlackMagic Design

(DaVinci) Resolve

(Eyeon) Fusion

Roxio Creator Suite

Apple Final Cut Pro and iMovie

Blender Cycles & Bullet

Side Effects Houdini

DigitalFilmTools

OTOY OctaneRender 3

SAM Alchemist XF

More?