How to install OpenCL on Windows

windows-start-openclGetting your Windows machine ready for OpenCL is rather straightforward. In short, you only need the latest drivers for your OpenCL device(s) and you’re ready to go. Of course, you will need to add an OpenCL SDK in case you want to develop OpenCL applications but that’s equally easy.

Before we start, a few notes:

  • The steps described herein have been tested on Windows 8.1 only, but should also apply for Windows 7 and Windows 8.
  • We will not discuss how to write an actual OpenCL program or kernel, but focus on how to get everything installed and ready for OpenCL on a Windows machine. This is because writing efficient OpenCL kernels is almost entirely OS independent.

If you want to know more about OpenCL and you are looking for simple examples to get started, check the Tutorials section on this webpage.

Running an OpenCL application

If you only need to run an OpenCL application without getting into development stuff then most probably everything already works.

If OpenCL applications fail to launch, then you need to have a closer look to the drivers and hardware installed on your machine:

GPU Caps Viewer
GPU Caps Viewer
  • Check that you have a device that supports OpenCL. All graphics cards and CPUs from 2011 and later support OpenCL. If your computer is from 2010 or before, check this page. You can also find a list with OpenCL conformant products on Khronos webpage.
  • Make sure your OpenCL device driver is up to date, especially if you’re not using the latest and greatest hardware. With certain older devices OpenCL support wasn’t initially included in the drivers.

Here is where you can download drivers manually:

  • Intel has hidden them a bit, but you can find them here with support for OpenCL 2.0.
  • AMD’s GPU-drivers include the OpenCL-drivers for CPUs, APUs and GPUs, version 2.0.
  • NVIDIA’s GPU-drivers mention mostly CUDA, but the drivers for OpenCL 1.1 1.2 are there too.

In addition, it is always a good idea to check for any other special requirements that the OpenCL application may have. Look for device type and OpenCL version in particular. For example, the application may run only on OpenCL CPUs, or conversely, on OpenCL GPUs. Or it may require a certain OpenCL version that your device does not support.

A great tool that will allow you to retrieve the details for the OpenCL devices in your system is Caps Viewer.

Developing OpenCL applications

Now it’s time to put the pedal to the metal and start developing some proper OpenCL applications.

The basic steps would be the following:

  • Make sure you have a machine which supports OpenCL, as described above.
  • Get the OpenCL headers and libraries included in the OpenCL SDK from your favourite vendor.
  • Start writing OpenCL code. That’s the difficult part.
  • Tell the compiler where the OpenCL headers are located.
  • Tell the linker where to find the OpenCL .lib files.
  • Build the fabulous application.
  • Run and prepare to be awed in amazement.

Ok, so let’s have a look into each of these.

OpenCL SDKs

For OpenCL headers and libraries the main options you can choose from are:

As long as you pay attention to the OpenCL version and the OpenCL features supported by your device, you can use the OpenCL headers and libraries from any of these three vendors.

OpenCL headers

Let’s assume that we are developing a 64bit C/C++ application using Visual Studio 2013. To begin with, we need to check how many OpenCL platforms are available in the system:

[raw]

#include<stdio.h>
#include<CL/cl.h>

int main(void)
{
    cl_int err;
    cl_uint numPlatforms;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS == err)
         printf("\nDetected OpenCL platforms: %d", numPlatforms);
    else
         printf("\nError calling clGetPlatformIDs. Error code: %d", err);

    return 0;
}

[/raw]

We need to specify where the OpenCL headers are located by adding the path to the OpenCL “CL” is in the same location as the other CUDA include files, that is, CUDA_INC_PATH. On a x64 Windows 8.1 machine with CUDA 6.5 the environment variable CUDA_INC_PATH is defined as “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include

If you’re using the AMD SDK, you need to replace “$(CUDA_INC_PATH)” with “$(AMDAPPSDKROOT)/include” or, for Intel SDK, with “$(INTELOCLSDKROOT)/include“.

OpenCLNVIDIA_AdditionalInclude

OpenCL libraries

Similarly, we need to let the linker know about the OpenCL libraries. Firstly, add OpenCL.lib to the list of Additional Dependencies:

OpenCLNVIDIA_AdditionalDependencies

Secondly, specify the OpenCL.lib location in Additional Library Directories:

OpenCLNVIDIA_AdditionalLibrary

As in the case of the includes, If you’re using the AMD SDK, replace “$(CUDA_LIB_PATH)” with “$(AMDAPPSDKROOT)/lib/x86_64” , or in the case of Intel with “$(INTELOCLSDKROOT)/lib/x64“.

And you’re good to go! The application should now build and run. Now, just how difficult was it? Happy OpenCL-coding on Windows!

If you have any question or suggestion, just leave a comment.

The Exascale rat-race vs getting-things-done with HPC

slide-12-638
IDC Forecasts 7 Percent Annual Growth for Global HPC Market – HPCwire

When the new supercomputer “Cartesius” of the Netherlands was presented to the public a few months ago, the buzz was not around FLOPS, but around users. SARA CEO Dr. Ir. Anwar Osseyran kept focusing on this aspect. The design of the machine was not pushed by getting into the TOP500, but by improving the performance of the current users’ software. This was applauded by various HPC experts, including StreamHPC. We need to get things done, not to win a virtual race of some number.

In the description about the supercomputer, the top500-position was only mentioned at the bottom of the page:

Cartesius entered the Top500 in November 2013 at position 184. This Top500 entry only involves the thin nodes resulting in a Linpack performance (Rmax) of 222.7 Tflop/s. Please note that Cartesius is designed to be a well balanced system instead of being a Top500 killer. This is to ensure maximum usability for the Dutch scientific community.

What would happen if you go for a TOP500 supercomputer? You might get a high energy bill and an overpriced, inefficient supercomputer. The first months you will not have full usage of the machine, and you won’t be able to easily turn off some parts, hence the spill of electricity. This results, finally, in that it is better to run unoptimized code on the cluster than to take time for coding.

The inefficiency is due to the fact that some software is data-transfer limited and other is compute-limited. No need to explain that if you go for a Top 500 and not for software optimized design, you end up buying extra hardware to get all kinds of algorithms performing. Cartesius therefore has “fat nodes” and “light nodes” to get the best bang per buck.

There is also a plan for expanding the machine over the years (on-demand growth), such that the users will remain happy instead of having an adrenaline-shot at once.

The rat-race

The HPC Top 500 is run by the company behind ISC-events. They care about their list being used, not if there is Exascale now or later. There is one company who has a particular interest in Exascale: Intel and IBM. It hardly matters anymore how it begun. What is interesting is that Intel has bought Infiniband and is collecting companies that could make them the one-stop shop for a HPC-cluster. IBM has always been strong in supercomputers with their BlueGene HPC-line. Intel has a very nice infographic on Intel+Exascale, which shows how serious they are.

But then the big question comes: did all this pushing speed up the road to Exascale? Well, no… just the normal peaks and lows round the logarithmic theoretic line:

Top500-exponential-growth
source: CNET

What I find interesting in this graph is that the #500 line is diverging from the #1 line. With GPGPU is would was quite easy to enter the top 500 3 years ago.

Did the profits rise? Yes. While PC-sales went down, HPC-revenues grew:

Revenues in the high-performance computing (HPC) server space jumped 7.7 percent last year to $11.1 billion surpassing the $10.3 billion in revenues generated in 2011, according to numbers released by IDC March 21. This came despite a 6.8 percent drop in shipments, an indication of the growing average selling prices in the space, the analysts said. (eWeek.)

So, mainly the willingness of buying HPC has increased. And you cannot stay behind when the rest of the world is focusing on Exascale, can you?

Read more

Keep your feet on the ground and focus on what matters: papers and answers to hard questions.

Did you solve a compute problem and got published with an sub-top250 supercomputer? Share it in the comments!

OpenCL on Altera FPGAs

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Continue reading “OpenCL on Altera FPGAs”

We ported GROMACS from CUDA to OpenCL

GROMACS does soft matter simulations on molecular scale
GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of  “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

  • The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer.  We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
  • The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
  • Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
  • Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and  NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

  1. Improve your OpenCL skills. What really applies to the project is this quote:

    Tell me and I forget.
    Teach me and I remember.
    Involve me and I learn.

  2. Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
  3. Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
  4. Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have.  We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

  • The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
    • Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
    • Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
    • Berk Hess. Who helped answering the harder questions and feeding the discussions.
  • Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
  • Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
  • Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
  • Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
  • Michael Papili, for helping us with designing a demo for SC14.
  • Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.

Intel CPUs, GPUs & Xeon Phi

[infobox type=”information”]

Need a XeonPhi or Intel OpenCL programmer? Hire us!

[/infobox]

Intel has support for all their recent CPUs which have SSE 4.x and AVX. Since SandyBridge the CPUs tend to have good performance. On IndyBridge and later there is also support for the embedded GPU (Windows-only). XeonPhi has support for OpenCL, even though they promote OpenMP most.

SDK

Intel does not provide a standard SDK kit which contains both hardware and software, as their hardware is broadly available.

The driver can be downloaded from the Intel OpenCL page – select your OS at the upper-right and click ‘Download’.

The samples are included with the driver, if you use Windows. They can be downloaded separately here. If you have Linux, you can download the samples which have been ported to GCC from our blog – here you can also read on how to install the SDK.

Tools

There are various developer tools available. You can find them here:

  • Offline compiler (stand-alone (Windows+Linux) and VisualStudio-plugin)
  • OpenCL – Debugger (VisualStudio only)
  • Integration with Graphics Performance Analyzers (Windows-dowload)
  • VTune Amplifier XE for code-optimisation (more info here, starting at $899 for both Windows and Linux)

Supported hardware

In short: all Ivy Bridge and Sandy Bridge processors.

intel-opencl

Currently the HD4000 is the only embedded GPU that can do OpenCL, and only is supported via Windows drivers.

Xeon Phi

Intel’s official page has more info on the processor-card, and here you’ll find the most recent (public) info.

Xeon Phi
Non-production version of Xeon Phi with half the memory-banks visible around the (large) die.

CPUs and GPUs

With Xeons of 12 to 16 cores and AVX2 (512 bits wide vectors), OpenCL works very well on CPUs.

For GPU bug-reports go to this forum.

Learning material

See this blog post for information on where to find all drivers and samples.

To optimise OpenCL for Intel-processors, you can go through their very nice Optimization Guide. There is also a nice overview of tips&tricks in this article. The Intel OpenCL forums are also a very good source of information.

Phoronix OpenCL Benchmark 3.0 beta

So you want OpenCL-benchmarks? Phoronix is a benchmark for OSX and Linux, created by Michael Larabel, Matthew Tippett (http://en.wikipedia.org/wiki/Phoronix_Test_Suite). On Ubuntu Phoronix version 2.8 is in the Ubuntu “app store” (Synaptic), but 3.0 has those nice OpenCL-tests. The tests are based on David Bucciarelli‘s OpenCL demos. Starting to use Phonornix 3.0 (beta 1) is done in 4 easy steps:

  1. Download the latest beta-version from http://www.phoronix-test-suite.com/?k=downloads
  2. Extract. Can be anywehre. I chose /opt/phoronix-test-suite
  3. Install. Just type ./phoronix-test-suite in a terminal
  4. Use.

WARNING: It is beta-software and the following might not work on your machine! If you have problems with this tutorial and want or found a fix, post a reply.

Continue reading “Phoronix OpenCL Benchmark 3.0 beta”

Support matrix of Compute SDKs

Multi-Core Processors and the SDKs

The empty boxes tell IBM and ARM have a lot of influence. With NVIDIA’s current pace with introducing new products (hardware and CUDA), they could also take on ARM.

The matrix is restricted to current better-known compute technologies OpenCL, CUDA, Intel ArrBB, Pathscale ENZO, MS DirectCompute and AccelerEyes JacketLib.

X = All OSes, including MAC
D = Developer (private alpha or private beta)
P = Planned (as i.e. stated in Intel’s Q&A)
U = Unofficial (IBM’s OpenCL-SDK is promoted for their POWER-line)
L = Linux-only
W= Windows-only
? = Unknown if planned

Continue reading “Support matrix of Compute SDKs”

PDFs of Monday 5 September

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

  • Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
  • Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
  • Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
  • Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
  • LAMA: Library for Accelerated Math Applications for C/C++.

USB-stick sized ARM-computers

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers”

Is OpenCL coming to Apple iOS?

Answer: No, or not yet. Apple tested Intel and AMD hardware for OSX, and not portable devices. Sorry for the false rumour; I’ll keep you posted.

Update: It seems that OpenCL is on iOS, but only available to system-libraries and not for apps (directly). That explains part of the responsiveness of the system.

At the thirteenth of August 2011 Apple askked the Khronosgroup to test 7 unknown devices if they are conformant with OpenCL 1.1. As Apple uses OpenCL-conformant hardware by AMD, NVidia and Intel in their desktops, the first conclusion is that they have been testing their iOS-devices. A quick look at the list of available iOS devices for iOS 5 capable devices gives the following potential candidates:

  • iPhone 3GS
  • iPhone 4
  • iPhone 5
  • iPad
  • iPad 2
  • iPod Touch 4th generation
  • Apple TV
If OpenCL comes to iOS soon (as it is already tested), iOS 5 would be the moment. iOS 5 processors are all capable of getting speed-up by using OpenCL, so it is no nonsense-feature. This could speed up many features among media-conversion, security-enhancements and data-manipulation of data-streams. Where now the cloud or the desktop has to be used, in the future it can be done on the device.

Continue reading “Is OpenCL coming to Apple iOS?”

Products using OpenCL on ARM MALI are coming

mali-product-feature-CLSDK-940x300_vX1The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

http://www.youtube.com/watch?v=UQXfjvcqiQg

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming”

Why did AMD open source ROCm’s OpenCL driver-stack?

AMD open sourced the OpenCL driver stack for ROCm in the beginning of May. With this they kept their promise to open source (almost) everything. The hcc compiler was open sourced earlier, just like the kernel-driver and several other parts.

Why this is a big thing?
There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

It contains:

  • OpenCL 1.2 compatible language runtime and compiler
  • OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
  • Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

For testing the implementation, see Khronos OpenCL CTS framework or Phoronix openbenchmarking.org.

Why is it open sourced?

There are several reasons. AMD wants to stand out in HPC and therefore listened carefully to their customers, while taking good note on where HPC was going. Where open source used to be something not for businesses, it is now simply required to be commercially successful. Below are the most important answers to this question.

Give deeper understanding of how functions are implemented

It is very useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to use. It does not tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

Debug software

When you are working on a large project and have to work with proprietary libraries, this is a typical delay factor. I think every software engineer has this experience that the library does not perform as was documented and work-arounds had to be created. Depending on the project and the library, it could take weeks of delay – only sarcasm can describe these situations, as the legal documents were often a lot better than the software documents. When the library was open source, the debugger could step in and give the “aha” that was needed to progress.

When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and ofcourse your project hits that bug which nobody encountered before. Now all is open source, you can now step into the driver with the debugger. Moreover, the driver can be compiled with a fix instead of work-around.

Get bugs solved quicker

A trace now now include the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This not only improves reproducibility, but reduces the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

Have “unimportant” specific improvements done

Say your software is important and in the spotlight, like Blender or the LuxMark benchmark, then you can expect your software gets attention in optimisations. For the rest of us, we have to hope our special code-constructions are alike one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team does not have enough time. This is frustrating for both sides.

Now everybody can have their improvements submitted, giving it does not slow down the focus software ofcourse.

Get the feature set extended

Adding SPIR-V is easy now. The SPIRV-frontend needs to be added to ROCm and the right functions need to be added to the OpenCL driver. Unfortunately there is no support for OpenCL 2.x host-code yet – I understood by lack of demand.

For such extensions the AMD team needs to be consulted first, because this has implications on the test-suite.

Get support for complete new things

It takes a single person to make something completely new – this becomes a whole easier now.

More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg. Optimised 128 bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All high performance code is there for you.

Initiate alternative implementations (?)

Not a goal, but forks are coming for sure. For most forks the goals would be like the ones above, to later be merged with the master branch. There are a few forks that go their own direction – for now hard to predict where those will go.

Improve and increase university collaborations

If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Added bonus for the students is that the GPUs might be allowed to used for games too.

Timour Paltashev (Senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

Get better support in more Linux distributions

It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised), but the discussion is if this is part of the hardware or software to mark it as “libre”.

There are many obstacles to have ROCm complete stack included as the default, but with the current state it makes much more chance.

Performance

Phoronix has done some benchmarks on ROCm 1.4 OpenCL in January on several systems and now ROCm 1.5 OpenCL on a Radeon RX 470. Though the 1.5 benchmarks were more limited, the important conclusion is that the young compiler is now mostly on par with the closed source OpenCL implementation combined with the AMDGPU-drivers. Only Luxmark AMDGPU was (much) better. Same comparison for the old proprietary fgrlx drivers, which was fully optimised and the first goal to get even with. You’ll see that there will be another big step forward with ROCm 1.6 OpenCL.

Get started

You can find the build instructions here. Let us know in the comments what you’re going to do with it!

ARM

ARM is most known for their CPU architectures, but they also have a GPU architecture, MALI. Their devices support:

  • OpenCL
  • OpenGL
  • Vulkan
  • RenderScript

OpenCL

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDKmali-product-feature-CLSDK-940x300_vX1

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

mali-made

Drivers for Android

Software available for the Arndale can be found here – drivers (including graphics drivers with OpenCL-support) are here. The current state is that their OpenCL drivers are sometimes working, most times not – we are very sorry for that and try to find fixes.

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

Firefly RK3288

Firefly-RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

  1. Arndale 5250-A
  2. YICsystem YSE5250
  3. YICsystem SMDKC520
  4. Nexus 10
  5. Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

  • Cortex-A15 @ 1.7 GHz dual core
  • 128 bit SIMD NEON
  • 2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

For more information and to order, go to http://howchip.com/shop/item.php?it_id=AND5250A. For an overview of extensions, go to: http://howchip.com/shop/content.php?co_id=ArndaleBoard_en. The price is $250 for the board, $50 for shipping to Europe, and extension-boards start at $60. You need a VAT-number to get it through the customs, but you have to pay EU-VAT anyway.

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Be sure you can buy a good 5V adapter (the + at the pin). A minimum of 3A is required for the board (TDP of the whole board is 11 to 12 Watt). Adapter costs around $25,- in the store, or you can buy them online for $7,50. You also need a serial cable – there are USB2COM-cables under €20,-. If you are in doubt, buy the $60,- package with cables (no COM2USB), adapter and microSD card.

Board 2: YICsystem YSE5250

This board has 2GB DDR3L(32bit 800MHz) and 8GB eMMC (onboard memory-card), USB 3.0 and LAN. Optional boards are for audio, WIFI, sensors (Gyroscope, Accelerometer, Magnetic, Light & Proximity), 5MB camera, LCD and GPS.

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

SONY DSC

Below you’ll find the layout of the board.

yes5250_layout

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 3: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

smdkc520

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

n10-product-hero

Order information

You can order the Nexus 10 not in all countries, as google has restricted sales channels. See http://www.google.com/nexus/10/ for more info on ordering. With some creativity you can find ways to order this tablet into countries not selected by Google. Price is $400 or €400.

Board 6: Samsung Chromebook

chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

ARM MALI

[infobox type=”information”]

Need a MALI programmer? Hire us!

[/infobox]

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDKmali-product-feature-CLSDK-940x300_vX1

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

mali-made

Drivers for Android

Software available for the Arndale can be found here – drivers (including graphics drivers with OpenCL-support) are here. The current state is that their OpenCL drivers are sometimes working, most times not – we are very sorry for that and try to find fixes.

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

 

Firefly RK3288

Firefly-RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

  1. Arndale 5250-A
  2. ARMBRIX Zero
  3. YICsystem YSE5250
  4. YICsystem SMDKC520
  5. Nexus 10
  6. Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

  • Cortex-A15 @ 1.7 GHz dual core
  • 128 bit SIMD NEON
  • 2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

For more information and to order, go to http://howchip.com/shop/item.php?it_id=AND5250A. For an overview of extensions, go to: http://howchip.com/shop/content.php?co_id=ArndaleBoard_en. The price is $250 for the board, $50 for shipping to Europe, and extension-boards start at $60. You need a VAT-number to get it through the customs, but you have to pay EU-VAT anyway.

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Be sure you can buy a good 5V adapter (the + at the pin). A minimum of 3A is required for the board (TDP of the whole board is 11 to 12 Watt). Adapter costs around $25,- in the store, or you can buy them online for $7,50. You also need a serial cable – there are USB2COM-cables under €20,-. If you are in doubt, buy the $60,- package with cables (no COM2USB), adapter and microSD card.

Board 2: ARMBRIX Zero  cancelled

This board has the same processor, but lacks the extension-boards and wifi/bluetooth. It does have 2GB 800MHz DDR3, various USB-ports, SATA(!) and Ethernet RJ45.

ARMBRIX

Order information

For more information and to order, go to http://www.howchip.com/shop/item.php?it_id=BRIX5250A. The price is $145 for the board, $50 for shipping to Europe. You need a VAT-number to get it through the customs.

Make sure you can buy a good adapter (the + at the pin). The minimum for the Arndale is 3A, but I am not sure this board draws less. Adapter costs around $25,- in the store, or you can buy them online for $7,50.

Board 3: YICsystem YSE5250

This board has 2GB DDR3L(32bit 800MHz) and 8GB eMMC (onboard memory-card), USB 3.0 and LAN. Optional boards are for audio, WIFI, sensors (Gyroscope, Accelerometer, Magnetic, Light & Proximity), 5MB camera, LCD and GPS.

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

SONY DSC

Below you’ll find the layout of the board.

yes5250_layout

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 4: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

smdkc520

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

n10-product-hero

Order information

You can order the Nexus 10 not in all countries, as google has restricted sales channels. See http://www.google.com/nexus/10/ for more info on ordering. With some creativity you can find ways to order this tablet into countries not selected by Google. Price is $400 or €400.

Board 6: Samsung Chromebook

chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

The 8 reasons why our customers had their code written or accelerated by us

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us”

Market Positioning of Graphics and Compute solutions

positioningWhen compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

AMD NVIDIA Intel ARM
Desktop User * A-series APU  – Iris / Iris Pro  –
Laptop User * A-series APU  – Iris / Iris Pro  –
Mobile User  – Tegra Iris Mali T720 / T4xx
Desktop Gamer Radeon GeForce  –  –
Laptop Gamer Radeon M GeForce M  –  –
Mobile High-end  – Tegra K (?) Iris Pro Mali T760 / T6xx
Desktop Graphics FirePro W Quadro  –  –
Laptop Graphics FirePro M Quadro M  –  –
Desktop (DP) Compute FirePro W Titan (hdmi) / Tesla (no video-out) XeonPhi  –
Laptop (DP) Compute FirePro M Quadro M XeonPhi  –
Server (DP) Compute FirePro S Tesla XeonPhi (active cooling!)  –
Cloud Sky Grid  –  –

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on Hardware.info:

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

  1. Double Precision GFLOPS
  2. Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
  3. Faster support
  4. (Free) Developer Tools
  5. System Configuration Software (click-click and compute works)
  6. Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
  7. Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.

Conclusion

Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

OpenCL vs CUDA Misconceptions


Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)


Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

Also found that Google Insights agrees with what I constructed manually.

Continue reading “OpenCL vs CUDA Misconceptions”

Vivante GPU (Freescale i.MX6)

Vivante-Logo

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted the open standard OpenCL. We at StreamHPC were very glad with this recognition from the automotive-industry.

See their GPGPU-page for more info, where below table is taken from.

 

gflops-mm2

GC800 Series GC1000 Series GC2000 Series GC4000 Series GC5000 Series GC6000 Series
Clock Speed MHz 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800
Compute Cores 1 1 1 2 4 Up to 8
Shader Cores 1 (Vec-4)
4 (Vec-1)
2 / 4 (Vec-4)
8 / 16 (Vec-1)
4 (Vec-4)
16 (Vec-1)
8 (Vec-4)
32 (Vec-1)
8 (Vec-4)
32 (Vec-1)
16 (Vec-4)
64 (Vec-1)
Shader GFLOPS 6–8 (High)
12–16 (Medium)
11–30 (High)
22–60 (Medium)
22–30 (High)
44–60 (Medium)
44–60 (High)
88–120 (Medium)
44–60 (High)
88–120 (Medium)
88–118 (High)
176–236 (Medium)
GPGPU Options Embedded Profile Embedded Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile
Cache Coherent Yes Yes Yes Yes Yes Yes

 

One big advantage Vivante claims to have over the competition, is the GFLOPS/mm2. This could be of advantage to win the 1TFLOPS-war over their competition (which they’ve entered). The upcoming GC4000 series can push around 48GFLOPS, leaving the 1TFLOPS to the GC6000 series.

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

Freescale-COLOR_LOGO_JPG3The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

There are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

SAMSUNG

Other Boards

Alternative evaluation boards from 3rd parties can be found by searching on internet for “i.MX6Q board” as there are many! For instance the Wandboard (i.MX6Q, $139, shown at the left – was tipped that the Dual is actually a Duallite and thus not have support!!).

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

  • IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
  • IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

aea40d73gd9233c367a17&690

More info

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

Freescale / Vivante

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted an open standard. They have support for:

  • OpenCL
  • OpenGL
  • Google RenderScript

OpenCL

See Vivante’s GPGPU-page for more info, where below table is taken from.

GC800 Series GC1000 Series GC2000 Series GC4000 Series GC5000 Series GC6000 Series
Clock Speed MHz 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800
Compute Cores 1 1 1 2 4 Up to 8
Shader Cores 1 (Vec-4)
4 (Vec-1)
2 / 4 (Vec-4)
8 / 16 (Vec-1)
4 (Vec-4)
16 (Vec-1)
8 (Vec-4)
32 (Vec-1)
8 (Vec-4)
32 (Vec-1)
16 (Vec-4)
64 (Vec-1)
Shader GFLOPS 6–8 (High)
12–16 (Medium)
11–30 (High)
22–60 (Medium)
22–30 (High)
44–60 (Medium)
44–60 (High)
88–120 (Medium)
44–60 (High)
88–120 (Medium)
88–118 (High)
176–236 (Medium)
GPGPU Options Embedded Profile Embedded Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile
Cache Coherent Yes Yes Yes Yes Yes Yes

One big advantage Vivante claims to have over the competition, is the GFLOPS/mm2. This could be of advantage to win the 1TFLOPS-war over their competition (which they’ve entered). The upcoming GC4000 series can push around 48GFLOPS, leaving the 1TFLOPS to the GC6000 series.

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

Freescale-COLOR_LOGO_JPG3The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

SAMSUNGThere are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

 

Other Boards

Alternative evaluation boards from 3rd parties can be found by searching on internet for “i.MX6Q board” as there are many! For instance the Wandboard (i.MX6Q, $139, shown at the left – was tipped that the Dual is actually a Duallite and thus not have support!!).

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

  • IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
  • IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

 

More infoaea40d73gd9233c367a17&690

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

Overview of OpenCL 2.0 hardware support, samples, blogs and drivers

opencl20We were too busy lately to tell you about it: OpenCL 2.0 is getting ready for prime time! As it makes use of the more recent hardware features, it’s therefore more powerful than OpenCL 1.x could ever be.

To get you up to speed, see this list of new OpenCL 2.0 features:

  • Shared Virtual Memory: host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.
  • Dynamic Parallelism: device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.
  • Generic Address Space: functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.
  • Improved image support:  including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.
  • C11 Atomics: a subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.
  • Pipes: memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.
  • Android Installable Client Driver Extension: Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

I could write many articles about the above subjects, but leave that for later. This article won’t get into these technical details, but more into what’s available from the vendors. So let’s see what toys we were given!

A note: don’t start with OpenCL 2.0 directly, if you don’t know the basic concepts of OpenCL. Continue reading “Overview of OpenCL 2.0 hardware support, samples, blogs and drivers”

CPU Code modernisation – our hidden expertise

You’ve seen the speedups possible on GPUs. We secretly know that many of these techniques would also work on modern multi-core CPUs. If after the first optimisations the GPU still gets an 8x speedup, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU? Currently the GPU is chosen more often.

Now AMD, Intel and AMD have 28+ core CPUs, the answer to that question might now lean towards the CPU. With a CPU that has 32 cores and 256bit vector-computations via AVX2, each clock-cycle 32 double4 can be computed. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance. Actual performance compared to peak-performance is comparable to GPUs here. Continue reading “CPU Code modernisation – our hidden expertise”