Big announcements: SYCL 1.2, WebCL 1.0 and OpenCL 2.0

opencl20Khronos just announced three OpenCL based releases:

  •  SYCL 1.2 Provisional Spec – Abstraction Layer for Leveraging C++ and OpenCL
  • WebCL 1.0 Final Spec – JavaScript bindings to OpenCL
  • OpenCL 2.0 Adopters Program – Conformance for OpenCL 2.0 implementations

Below I’ve quoted the summaries. For each of these I’ve prepared articles, but due to lack of time haven’t been able to finish and publish them. So for now some remarks after the summaries.

Khronos Releases SYCL 1.2 Provisional Specification

Programming abstraction layer to enable applications and high-level frameworks to leverage C++ and OpenCL for heterogeneous parallel acceleration

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback.  SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++.  For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration – and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

https://www.khronos.org/news/press/khronos-releases-sycl-1.2-provisional-specification

Higher level languages are very important, as OpenCL is simply too low-level. SYCL is another effort to help researching & improving this area, as we haven’t found the holy grail. Languages like C++AMP and RenderScript claim they can replace OpenCL, but we all know that some implementations of those languages have been done on top of OpenCL.

Khronos Releases WebCL 1.0 Specification

JavaScript bindings to OpenCL brings heterogeneous parallel computing to Web browsers

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the ratification and public release of the WebCL™ 1.0 specification.  Developed in close cooperation with the Web community, WebCL extends the capabilities of HTML5 browsers by enabling developers to offload computationally intensive processing to available computational resources such as multicore CPUs and GPUs.  WebCL defines JavaScript bindings to OpenCL™ APIs that enable Web applications to compile OpenCL C kernels and manage their parallel execution.  Like WebGL™, WebCL is expected to enable a rich ecosystem of JavaScript middleware that provides access to accelerated functionality to a wide diversity of Web developers.

https://www.khronos.org/news/press/khronos-releases-webcl-1.0-specification

WebCL gets more and more attention, even before it was even official. It would be interesting to see the same growth to higher level language as we have with OpenCL now. for this reason we started the Learning WebCL website, to help you learn WebCL in the future.

Khronos Launches OpenCL 2.0 Adopters Program

Conformance tests now available to certify OpenCL 2.0 implementations

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the availability of the official conformance test suite for the OpenCL 2.0 specification, making it possible for implementers to certify that their implementations are officially conformant thorough the Khronos OpenCL Adopters Program.  Khronos has also released a set of header files for OpenCL 2.0 and an updated specification with a number of clarifications and corrections to the specification first released in November 2013.

https://www.khronos.org/news/press/khronos-launches-opencl-2.0-adopters-program

Finally the headers are open. Stay tuned for an extensive OpenCL 1.2 vs OpenCL 2.0 comparison, which I have prepared but were unable to finish without the header files.

I hope you are as happy with these announcements as I am. This tells me that OpenCL is ready for real business.

Privacy Policy

Who we are

We are a group of companies, based in the Netherlands, Hungary and Spain. We help our customers get their code run fast by optimizing the computations and using accelerators. We do this since 2010.

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymised string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service Privacy Policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Forms

Form-data is sent to self-hosted software and is not read by any third-party party.

Tracking

We use anonymized tracking to find out:

  • Which pages are visited how often
  • Which subjects are popular
  • Which pages are clicked through
  • From which countries or states the visitors are

During a visit/session, you get a random ID.

Cookies

If you leave a comment on our site you may opt in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

Tracking cookies last for 24 hours.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Who we share your data with

None of the data is shared with any third party. Marketing reports don’t contain any personal data.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

Anonymous tracking data is not thrown away, to find trends over the years.

What rights you have over your data

If you have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where your data is sent

Visitor comments and forms are checked through an automated spam detection service, ReCAPTCHA and Akismet.

Reporting problems

We are not in the business of monetizing user data, and believe in finding new customers through content.

As software and plugins change after updates, we are sometimes surprised that more is collected than we configured.

If anything is incorrect or not legal, please email to privacy@streamhpc.com. If you have generic questions, go to the contact page or email to info@streamhpc.com.

Starting with GROMACS and OpenCL

Gromacs-OpenCLNow that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git  plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout  via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

  • -DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
  • -DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
  • -DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
  • -DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
  • -DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
  • -DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
  • -DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
  • -DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
  • -DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
  • -DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
  • -DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

 

How to install OpenCL on Windows

windows-start-openclGetting your Windows machine ready for OpenCL is rather straightforward. In short, you only need the latest drivers for your OpenCL device(s) and you’re ready to go. Of course, you will need to add an OpenCL SDK in case you want to develop OpenCL applications but that’s equally easy.

Before we start, a few notes:

  • The steps described herein have been tested on Windows 8.1 only, but should also apply for Windows 7 and Windows 8.
  • We will not discuss how to write an actual OpenCL program or kernel, but focus on how to get everything installed and ready for OpenCL on a Windows machine. This is because writing efficient OpenCL kernels is almost entirely OS independent.

If you want to know more about OpenCL and you are looking for simple examples to get started, check the Tutorials section on this webpage.

Running an OpenCL application

If you only need to run an OpenCL application without getting into development stuff then most probably everything already works.

If OpenCL applications fail to launch, then you need to have a closer look to the drivers and hardware installed on your machine:

GPU Caps Viewer
GPU Caps Viewer

  • Check that you have a device that supports OpenCL. All graphics cards and CPUs from 2011 and later support OpenCL. If your computer is from 2010 or before, check this page. You can also find a list with OpenCL conformant products on Khronos webpage.
  • Make sure your OpenCL device driver is up to date, especially if you’re not using the latest and greatest hardware. With certain older devices OpenCL support wasn’t initially included in the drivers.

Here is where you can download drivers manually:

  • Intel has hidden them a bit, but you can find them here with support for OpenCL 2.0.
  • AMD’s GPU-drivers include the OpenCL-drivers for CPUs, APUs and GPUs, version 2.0.
  • NVIDIA’s GPU-drivers mention mostly CUDA, but the drivers for OpenCL 1.1 1.2 are there too.

In addition, it is always a good idea to check for any other special requirements that the OpenCL application may have. Look for device type and OpenCL version in particular. For example, the application may run only on OpenCL CPUs, or conversely, on OpenCL GPUs. Or it may require a certain OpenCL version that your device does not support.

A great tool that will allow you to retrieve the details for the OpenCL devices in your system is Caps Viewer.

Developing OpenCL applications

Now it’s time to put the pedal to the metal and start developing some proper OpenCL applications.

The basic steps would be the following:

  • Make sure you have a machine which supports OpenCL, as described above.
  • Get the OpenCL headers and libraries included in the OpenCL SDK from your favourite vendor.
  • Start writing OpenCL code. That’s the difficult part.
  • Tell the compiler where the OpenCL headers are located.
  • Tell the linker where to find the OpenCL .lib files.
  • Build the fabulous application.
  • Run and prepare to be awed in amazement.

Ok, so let’s have a look into each of these.

OpenCL SDKs

For OpenCL headers and libraries the main options you can choose from are:

As long as you pay attention to the OpenCL version and the OpenCL features supported by your device, you can use the OpenCL headers and libraries from any of these three vendors.

OpenCL headers

Let’s assume that we are developing a 64bit C/C++ application using Visual Studio 2013. To begin with, we need to check how many OpenCL platforms are available in the system:

[raw]

#include<stdio.h>
#include<CL/cl.h>

int main(void)
{
    cl_int err;
    cl_uint numPlatforms;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS == err)
         printf("\nDetected OpenCL platforms: %d", numPlatforms);
    else
         printf("\nError calling clGetPlatformIDs. Error code: %d", err);

    return 0;
}

[/raw]

We need to specify where the OpenCL headers are located by adding the path to the OpenCL “CL” is in the same location as the other CUDA include files, that is, CUDA_INC_PATH. On a x64 Windows 8.1 machine with CUDA 6.5 the environment variable CUDA_INC_PATH is defined as “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include

If you’re using the AMD SDK, you need to replace “$(CUDA_INC_PATH)” with “$(AMDAPPSDKROOT)/include” or, for Intel SDK, with “$(INTELOCLSDKROOT)/include“.

OpenCLNVIDIA_AdditionalInclude

OpenCL libraries

Similarly, we need to let the linker know about the OpenCL libraries. Firstly, add OpenCL.lib to the list of Additional Dependencies:

OpenCLNVIDIA_AdditionalDependencies

Secondly, specify the OpenCL.lib location in Additional Library Directories:

OpenCLNVIDIA_AdditionalLibrary

As in the case of the includes, If you’re using the AMD SDK, replace “$(CUDA_LIB_PATH)” with “$(AMDAPPSDKROOT)/lib/x86_64” , or in the case of Intel with “$(INTELOCLSDKROOT)/lib/x64“.

And you’re good to go! The application should now build and run. Now, just how difficult was it? Happy OpenCL-coding on Windows!

If you have any question or suggestion, just leave a comment.

AMD ROCm 1.5 Linux driver-stack is out

ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

About ROCm and it’s short history

The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

What does ROCm 1.5 bring?

Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

Other improvements:

  1. Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
  2. Added new API hipMemPtrGetInfo
  3. Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

Bug fixes:

  1. Fixed Copyright and header names
  2. Fixed issue with bit_extract sample
  3. Enabled lgamma and lgammaf
  4. Added guard for GFX8 specific intrinsics
  5. Fixed few issues with operator overloading of vector data types
  6. Fixed atanf
  7. Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
  8. Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

Current hardware support:

  • GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
  • GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

Install ROCm 1.5

Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

You need to alter this line with every update, else it’ll keep using the old version.

Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

/opt/rocm/hsa/sample$ sudo make
gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Also clinfo (installed from the default repo) should work.

Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

How to introduce HPC in your enterprise

eviljaymz-spare-time
Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

  • Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
  • NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
  • AMD: For the S9150 get a W9100.
  • Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the  results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

P_setting_fff_1_90_end_500.pngIf your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

OpenCL alternatives for CUDA Linear Algebra Libraries

While CUDA has had the advantage of having many more libraries, this is no longer its main advantage if it comes to linear algebra. If one thing changed over the past year, then it is linalg library-support for OpenCL. The choices have been increased at a continuous rate, as you can see the below list.

A general remark when using these libraries. When using them you need to handle your data-transfers and correct data-format, with great care. If you don’t think it through, you won’t get the promised speed-up. If not mentioned, then free.

Subject CUDA OpenCL
FFT
cuff_ampchart

The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the…

clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.
Linear Algebra
MAGMA-Logo

MAGMA is a collection of next generation, GPU accelerated ,linear algebra libraries. Designed for heterogeneous GPU-based architectures. It supports interfaces to current LAPACK and BLAS standards.

clMAGMA is an OpenCL port of MAGMA for AMD GPUs. The clMAGMA library dependancies, in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for AMD hardware, can be found in the AMD Accelerated Parallel Processing Math Libraries (APPML).
Sparse Linear Algebra
cusp_logo

CUSP is an open source C++ library of generic parallel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs. CUSP provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.

clBLAS implements the complete set of BLAS level 1, 2 & 3 routines. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release ViennaCL 1.5.0 provides many new convenience functions and support for integer vectors and matrices.VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported.
Random number generation
cuRandImage

The NVIDIA CUDA Random Number Generation library (cuRAND) delivers high performance GPU-accelerated random number generation (RNG). The cuRAND library delivers high quality random numbers 8x…

The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL). They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.
KeyVisual_Primary_verysm

The CUDA Math library is an industry proven, highly accurate collection of standard mathematical functions. Available to any CUDA C or CUDA C++ application simply by adding “#include math.h” in…

Looking into the details of what the CUDA math lib exactly is.
AI
GPU_AI_games

A technology preview with CUDA accelerated game tree search of both the pruning and backtracking styles. Games available: 3D Tic-Tac-Toe, Connect-4, Reversi, Sudoku and Go.

There are many tactics to speed up such algorithms. This CUDA-library can therefore only be used for limited cases, but nevertheless it is a very interesting research-area. Ask us for an OpenCL based backtracking and pruning tree searching, tailored for your problem.
Dense Linear Algebra
CULAtoolslogo2
Provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. Also provides various solvers.
Free (with limitations) and commercial.
See ViennaCL, VexCL and clBLAS above. Kudos to the CULA-team, as they were one of the first with a full GPU-accelerated linear algebra product.
Fortran
RogueWave-IMSL-Box2
The IMSL Fortran Numerical Library is a comprehensive set of mathematical and statistical functions available that offloads CPU work to NVIDIA GPU hardware where the cuBLAS library is utilized.
Free (with limitations) and commercial.
OpenCL-FORTRAN is not available yet. Contact us, if you have interest and wish to work with a pre-release once available.
Subject
arrayfire_logo340

Comprehensive GPU function library, including functions for math, signal processing, image processing, statistics, and more. Interfaces for C, C++, Fortran, and Python. Integrates with any CUDA-program.

Free (with limitations) and commercial.

ArrayFire 2.0 is also available for OpenCL. Note that currently fewer functions are supported in the OpenCL-version than are supported in CUDA-ArrayFire, so please check the OpenCL documentation for supported feature list.Free (with limitations) and commercial.
Subject
nppeye

The NVIDIA Performance Primitives library (NPP) is a collection of over 1900 image processing primitives and nearly 600 signal processing primitives that deliver 5x to 10x faster performance than…

Kudos for NVIDIA for bringing it all at one place. OpenCL-devs have to do some googling for specific algorithms.

So the gap between CUDA and OpenCL is certainly closing. CUDA provides a lot more convenience, so OpenCL-devs still have to keep reading blogs like this one to find what’s out there.

As usual, if you have additions to this list (free and commercial), please let me know in the comments below or by mail. I also have a few more additions to this list myself – depending on your feedback, I might represent the data differently.

InsideHPC: SuperComputing. Where to from here?

In this video, Moderator Bob Feldman hosts a session entitled: Supercomputing: Where to from Here? Recorded at the National HPCC Conference 2011 in Newport.

Panelists:
Dr. Eng Lim Goh, SGI
Bill Feiereisen, Intel
Shumel Shottan, BlueARC
Steve Lyness, Appro International, Inc.
Marc Hamilton, HP Americas

http://www.youtube.com/watch?v=wI957eRr1kM

Below is a summary of what is told. It is just my notes, so go to the times mentioned to listen to the exact answers. Some details I did not write down, you might think are important, but I did not (or missed as I English is not my mother-tongue).

Continue reading “InsideHPC: SuperComputing. Where to from here?”

Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is  the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Apple’s dragging OpenCL compiler problem

OSX-brokenRemember the times that the OpenCL compilers where not that good as they’re now? Correct source-code being rejected, typos being accepted, long compile times, crashes during compiling and other irritating bugs. These made the work of an OpenCL developer in “the old days” quite tiresome – you needed a lot of persistence and report bugs. Lucky on desktops the drivers have improved a lot.

Apple’s buggy OpenCL compiler

Now to Apple. There have always been complaints about the irritating bugs that were in Apple’s compiler. Recently the Luxrender community started to make more complaints, as the guy responsible for the OSX port decided to quit. This was due to utter frustration: code that worked on every other OS, simply did not work on OSX. Luxrender’s Paolo Ciccone stood up and made this extremely public, by writing an open letter to Apple’s CEO Tim Cook (posted below).

The letter is not specific about the kind of bugs and and therefore asked him via Twitter which were the bugs he was talking about. He explained me that it’s very simple:

https://twitter.com/RealityPaolo/status/595972568961519616

Here at StreamHPC we could write around those bugs in most cases, but Luxrender has bigger and more complex kernels than we used in our projects – then it’s simply impossible to write around, as the compiler simply crashes. It seems that OSX still has those old compilers, Linux and Windows used to have years ago.

Metal

Metal is the OpenCL-alternative on iOS 8 and up.

If you’re thinking that Metal could be a reason – that language looks very much like OpenCL, as it’s simply OpenCL as Apple would like it to be. Porting between the two languages is therefore quite simple. This also means that with some small fixes a Metel-kernel could be compiled by existing OpenCL-compiler. Ok, there is much more than the compute part, but the message is that more complex Metal wouldn’t be possible using this driver-stack.

If we end up in a situation that Metal comes to OSX and is more stable than OpenCL, only then we can say that Apple tries to block OpenCL in favour of their own APIs.

The letter

I’m really happy that Paolo Ciccone had the guts to publicly complain. This is the letter he wrote:

Dear Mr. Cook.

I’m sorry to bother you but we have tried all other channels and nothing worked.

I’m part of a group of developers of a physically-based renderer called LuxRender. LuxRender has been written to use OpenCL to accelerate its enormous amount of computation necessary to generate photo-realistic scenes. You can see some of the images generated by Lux at http://luxrender.net. Lux is an Open Source program.

Apple has defined OpenCL and we have adopted this API instead of the proprietary CUDA in order to be able to work with all kind of hardware on all major platforms. It made sense for an OSS to use an open standard.

The reason why I’m writing to you is that, after waiting for years, we still have broken GPU drivers on OS X. Scenes that render perfectly well on Windows and even on Linux simply abort on OS X. This is happening with both AMD and nVidia GPUs.

The problem is unsolvable from our side. We need updated, fixed drivers for OS X. The problem is so bad hat our main OS X developer has announced, today, that he is giving up OS X. He simply can’t do his job.

I kindly request that you look into this and give us working AMD and nVidia drivers in an upcoming, possibly soon, update of OS X. We are more than willing to work with your engineers, if you need any kind of specific help in identifying the problem.

Thank you for your attention.

Paolo Ciccone

If you want to help, also post this letter on your blog or in a forum. The more this is shared, the better. Especially Apple’s forum, asking for the official statement.

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL”

OpenCL under Wine

The Wine 1.3 branch has support for OpenCL 1.0 since 1.3.9. Since Microsoft likes to get a little part of the Linux-dominated HPC-market, support for GPGPU is pretty good under the $799.00 costing Visual Studio – the free Express-version is not supported well. But why not take the produced software back via Wine? Problem is that OpenCL is not in the current Wine binaries for some reason, but that is fixable until we wait for inclusion…

Lazy or not much time? You can try my binaries (Ubuntu 32, NVIDIA), but I cannot guarantee they work for you and it is on your own responsibility: download (reported not working by some). See second part of step 3, what to do with it.

All the steps

I assume you have the OpenCL-SDK installed, but let me know if I need to add more details or clear up some steps.

1 – get the sources

The sources are available here. Be sure you download at least version 1.3.9. Alternatively you download the latest from git. You can get it by going to a directory and execute:

git clone git://source.winehq.org/git/wine.git

A directory “wine” will be created. That was easy, so lets go to bake some binaries.

Continue reading “OpenCL under Wine”

Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL

Stream HPC is 10 years old on 1 April 2020. Therefore we offer our one day GPGPU crash course for free that whole month.

Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

5 years ago…
Continue reading “Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL”

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

WebCL_300WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL”

How expensive is an operation on a CPU?

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about Perl LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

This article is just to help you on your way, and most of all: to make you aware. Note it is incomplete and probably not valid for all kinds of CPUs.

Continue reading “How expensive is an operation on a CPU?”

Imagination Technologies PowerVR

iamgination-tec-640_large

[infobox type=”information”]

Need a PowerVR programmer? Hire us!

[/infobox]

Currently there are two  PowerVR GPU architectures with OpenCL support: the 5 series (scroll down) and the 6 series (introduced in 2014).

PowerVR 6

In 2013 companies will launch processors using IP from Imagination Technologies, the PowerVR G6230 and G6430. Named licensees are:

  • ST-Ericcson: NovaThor A9600 – not available yet or even mentioned on their own webpage.
  • Texas Instruments: no products anounced, latest OMAP5-series are PowerVR 5 based.
  • Renesas Electronics: no products announced, latest are based on Series5 (SGX54x and 53x).
  • MediaTek: no products anounced, latest MT6577 is based on Series5.
  • HiSilicon: licensed, but no products announced.

While a lot of news was around this platform, it has been delayed several times. Their latest designs in the series are running on an FPGA and more details will be given at CES 2013 (source).

powervr_series6_architecture_original

Performance

Below you see where the PowerVR6 stands. It is clocked much higher (from 250MHz to 600MHz), which suggests it will be baked sub-sub 45 nm. The PowerVR 5 is used in for example the iPad2 and delivers around 70GFlops. The PowerVR6 G62x0 is promised to deliver 200GFlops and up. The TFLOPS barrier is promised to be broken with the series.

3_cpu_vs_gpu_GFLOPS_bars
Comparison between PowerVR 5 and 6 series.

 http://withimagination.imgtec.com/powervr/powervr-series6xe-gpus-bring-opengl-es-3-0-graphics-everyone

http://blog.imgtec.com/news/accelerate-design-closure-for-ip-cores-from-imagination-dok-design-flows

 

The below image shows the 6-series are baked on 32nm and below. It shows different series-identifiers though. The “3” in G6x30 is the addition of “frame buffer compression logic” (source).

A-FHVSlCYAAdd1n

PowerVR 5

The chipset that currently dominates mobile devices, from the PSP to tablets to phones to Apple iPad.

Drivers

Imagination only sells IP and refers to their licensees for driver-support. If you have ideas what to do with OpenCL -on-PowerVR, you can request for an NDA here.

Texas Instruments has the drivers available, but only under an SLA. Contact your TI-representative for the most recent information.

Samsung delivers drivers with their Exynos 5 Octa development board (Odroid XU).

Boards

Let me know if you know a TI-board and can give me a description of the business-requirements to get hands on drivers.

Exynos 5410: ODROID-XU

  • ARM Cortex-A15 Quad 1.6GHz + Cortex-A7 Quad 1.2GHz
  • PowerVR 5 SGX544 MP3 GPU
  • 2GB LPDDR3 (12.8GB/s memory bandwidth)
  • Lots of IO-ports (see image below). No wifi without dongle.
  • CCI-400 bug seems to be fixed (source). Not clear how.

201307292206337254

OpenCL info from the FAQ:

Which OpenGL and OpenCL are included in Android?
OpenGL ES 1.1 and OpenGL ES 2.0
OpenCL 1.1 Embedded Profile

Will It run Ubuntu or other Linux distros?
Currently we supports only Ubuntu 13.04 server version with only serial console.
We need to develop HDMI/LCD driver for Xorg display.
We are trying to release Linux BSP with OpenGL/OpenCL in Q4 of 2013.

Buy here. Forum here.

Drivers

See this page for explanation how to write the images. The links don’t get updated, so check the above links for the latest versions.

Minimum price for board+eMMC+shipping is $273,-. Price including some needed (eMMC, shipping), possibly needed (HDMI, USB-UART) and convenience add-on’s (SD, wifi) is:

odroid-cart

Board includes adaptor and case. An eMMC is needed for a fast OS, but a micro-SD also works – although slower.

Without the integrated power analysis tool, it’s $30,- less. Ordering only the board is $10 less shipping.

Devices

Once I get more (public) info on OpenCL-drivers, this section will be extended.

Kindle Fire HD

As seen on Engadget, Imagination Technologies is working with Amazon and Texas Instruments to deliver OpenCL-enabled Kindle Fire HDs.

http://www.youtube.com/watch?v=-twOwM4LP9o

The chipset is a OMAP 4470 by Texas Instruments, which contains a PowerVR SGX544 GPU running at 384MHz. It only delivers 24.5GFLOPS.

ST-Ericsson NovaThor LP9600 (Nova A9600)

This chipset has a dual-core ARM Cortex-A15 2,3 GHz, 28 nm, PowerVR series6 and “d-channel LP-DDR2”. It should become available in Q1 2013.

Since Ericcson will leave ST-Ericcson after a transition period, it is unclear if the chipset is delayed.

 

Imagination

Imagination is best known for their GPUs in Apples iDevices. They have support for:

  • OpenCL
  • Apple Metal
  • Vulkan
  • OpenGL
  • Google RenderScript

Imagination is a strong supporter of Khronos APIs OpenCL and Vulkan.

OpenCL

Currently there are two PowerVR GPU architectures with OpenCL support: the 5 series (scroll down) and the 6 series (introduced in 2014).

PowerVR 6

In 2013 companies will launch processors using IP from Imagination Technologies, the PowerVR G6230 and G6430. Named licensees are:

  • ST-Ericcson: NovaThor A9600 – not available yet or even mentioned on their own webpage.
  • Texas Instruments: no products anounced, latest OMAP5-series are PowerVR 5 based.
  • Renesas Electronics: no products announced, latest are based on Series5 (SGX54x and 53x).
  • MediaTek: no products anounced, latest MT6577 is based on Series5.
  • HiSilicon: licensed, but no products announced.

While a lot of news was around this platform, it has been delayed several times. Their latest designs in the series are running on an FPGA and more details will be given at CES 2013 (source).

powervr_series6_architecture_original

Performance

Below you see where the PowerVR6 stands. It is clocked much higher (from 250MHz to 600MHz), which suggests it will be baked sub-sub 45 nm. The PowerVR 5 is used in for example the iPad2 and delivers around 70GFlops. The PowerVR6 G62x0 is promised to deliver 200GFlops and up. The TFLOPS barrier is promised to be broken with the series.

3_cpu_vs_gpu_GFLOPS_bars
Comparison between PowerVR 5 and 6 series.

 http://withimagination.imgtec.com/powervr/powervr-series6xe-gpus-bring-opengl-es-3-0-graphics-everyone

http://blog.imgtec.com/news/accelerate-design-closure-for-ip-cores-from-imagination-dok-design-flows

The below image shows the 6-series are baked on 32nm and below. It shows different series-identifiers though. The “3” in G6x30 is the addition of “frame buffer compression logic” (source).

A-FHVSlCYAAdd1n

PowerVR 5

The chipset that currently dominates mobile devices, from the PSP to tablets to phones to Apple iPad.

Drivers

Imagination only sells IP and refers to their licensees for driver-support. If you have ideas what to do with OpenCL -on-PowerVR, you can request for an NDA here.

Texas Instruments has the drivers available, but only under an SLA. Contact your TI-representative for the most recent information.

Samsung delivers drivers with their Exynos 5 Octa development board (Odroid XU).

Boards

Let me know if you know a TI-board and can give me a description of the business-requirements to get hands on drivers.

Exynos 5410: ODROID-XU

  • ARM Cortex-A15 Quad 1.6GHz + Cortex-A7 Quad 1.2GHz
  • PowerVR 5 SGX544 MP3 GPU
  • 2GB LPDDR3 (12.8GB/s memory bandwidth)
  • Lots of IO-ports (see image below). No wifi without dongle.
  • CCI-400 bug seems to be fixed (source). Not clear how.

201307292206337254

OpenCL info from the FAQ:

Which OpenGL and OpenCL are included in Android?
OpenGL ES 1.1 and OpenGL ES 2.0
OpenCL 1.1 Embedded Profile

Will It run Ubuntu or other Linux distros?
Currently we supports only Ubuntu 13.04 server version with only serial console.
We need to develop HDMI/LCD driver for Xorg display.
We are trying to release Linux BSP with OpenGL/OpenCL in Q4 of 2013.

Buy here. Forum here.

Drivers

See this page for explanation how to write the images. The links don’t get updated, so check the above links for the latest versions.

Minimum price for board+eMMC+shipping is $273,-. Price including some needed (eMMC, shipping), possibly needed (HDMI, USB-UART) and convenience add-on’s (SD, wifi) is:

odroid-cart

Board includes adaptor and case. An eMMC is needed for a fast OS, but a micro-SD also works – although slower.

Without the integrated power analysis tool, it’s $30,- less. Ordering only the board is $10 less shipping.

Devices

Once I get more (public) info on OpenCL-drivers, this section will be extended.

Kindle Fire HD

As seen on Engadget, Imagination Technologies is working with Amazon and Texas Instruments to deliver OpenCL-enabled Kindle Fire HDs.

http://www.youtube.com/watch?v=-twOwM4LP9o

The chipset is a OMAP 4470 by Texas Instruments, which contains a PowerVR SGX544 GPU running at 384MHz. It only delivers 24.5GFLOPS.

Support matrix of Compute SDKs

Multi-Core Processors and the SDKs

The empty boxes tell IBM and ARM have a lot of influence. With NVIDIA’s current pace with introducing new products (hardware and CUDA), they could also take on ARM.

The matrix is restricted to current better-known compute technologies OpenCL, CUDA, Intel ArrBB, Pathscale ENZO, MS DirectCompute and AccelerEyes JacketLib.

X = All OSes, including MAC
D = Developer (private alpha or private beta)
P = Planned (as i.e. stated in Intel’s Q&A)
U = Unofficial (IBM’s OpenCL-SDK is promoted for their POWER-line)
L = Linux-only
W= Windows-only
? = Unknown if planned

Continue reading “Support matrix of Compute SDKs”

Freescale / Vivante

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted an open standard. They have support for:

  • OpenCL
  • OpenGL
  • Google RenderScript

OpenCL

See Vivante’s GPGPU-page for more info, where below table is taken from.

GC800 Series GC1000 Series GC2000 Series GC4000 Series GC5000 Series GC6000 Series
Clock Speed MHz 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800
Compute Cores 1 1 1 2 4 Up to 8
Shader Cores 1 (Vec-4)
4 (Vec-1)
2 / 4 (Vec-4)
8 / 16 (Vec-1)
4 (Vec-4)
16 (Vec-1)
8 (Vec-4)
32 (Vec-1)
8 (Vec-4)
32 (Vec-1)
16 (Vec-4)
64 (Vec-1)
Shader GFLOPS 6–8 (High)
12–16 (Medium)
11–30 (High)
22–60 (Medium)
22–30 (High)
44–60 (Medium)
44–60 (High)
88–120 (Medium)
44–60 (High)
88–120 (Medium)
88–118 (High)
176–236 (Medium)
GPGPU Options Embedded Profile Embedded Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile
Cache Coherent Yes Yes Yes Yes Yes Yes

One big advantage Vivante claims to have over the competition, is the GFLOPS/mm2. This could be of advantage to win the 1TFLOPS-war over their competition (which they’ve entered). The upcoming GC4000 series can push around 48GFLOPS, leaving the 1TFLOPS to the GC6000 series.

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

Freescale-COLOR_LOGO_JPG3The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

SAMSUNGThere are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

 

Other Boards

Alternative evaluation boards from 3rd parties can be found by searching on internet for “i.MX6Q board” as there are many! For instance the Wandboard (i.MX6Q, $139, shown at the left – was tipped that the Dual is actually a Duallite and thus not have support!!).

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

  • IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
  • IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

 

More infoaea40d73gd9233c367a17&690

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

Vivante GPU (Freescale i.MX6)

Vivante-Logo

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted the open standard OpenCL. We at StreamHPC were very glad with this recognition from the automotive-industry.

See their GPGPU-page for more info, where below table is taken from.

 

gflops-mm2

GC800 Series GC1000 Series GC2000 Series GC4000 Series GC5000 Series GC6000 Series
Clock Speed MHz 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800 600 – 800
Compute Cores 1 1 1 2 4 Up to 8
Shader Cores 1 (Vec-4)
4 (Vec-1)
2 / 4 (Vec-4)
8 / 16 (Vec-1)
4 (Vec-4)
16 (Vec-1)
8 (Vec-4)
32 (Vec-1)
8 (Vec-4)
32 (Vec-1)
16 (Vec-4)
64 (Vec-1)
Shader GFLOPS 6–8 (High)
12–16 (Medium)
11–30 (High)
22–60 (Medium)
22–30 (High)
44–60 (Medium)
44–60 (High)
88–120 (Medium)
44–60 (High)
88–120 (Medium)
88–118 (High)
176–236 (Medium)
GPGPU Options Embedded Profile Embedded Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile Embedded / Full Profile
Cache Coherent Yes Yes Yes Yes Yes Yes

 

One big advantage Vivante claims to have over the competition, is the GFLOPS/mm2. This could be of advantage to win the 1TFLOPS-war over their competition (which they’ve entered). The upcoming GC4000 series can push around 48GFLOPS, leaving the 1TFLOPS to the GC6000 series.

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

Freescale-COLOR_LOGO_JPG3The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

There are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

SAMSUNG

Other Boards

Alternative evaluation boards from 3rd parties can be found by searching on internet for “i.MX6Q board” as there are many! For instance the Wandboard (i.MX6Q, $139, shown at the left – was tipped that the Dual is actually a Duallite and thus not have support!!).

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

  • IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
  • IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

aea40d73gd9233c367a17&690

More info

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

How to speed up Excel in 6 steps

After the last post on Excel (“Accelerating an Excel Sheet with OpenCL“), there have been various request and discussions how we do “the miracle”. Short story: we only apply proper engineering tactics. Below I’ll explain how you can also speed up Excel and when you actually have to call us (last step).

A computer can handle 10s of gigabytes per second. Now look how big your Excel-sheet is and how much time it takes. Now you understand that the problem is probably not your computer.

Excel is a special piece of software from a developer’s perspective. An important rule of software engineering is to keep functionality (code) and data separate. Excel mixes these two as no other, which actually goes pretty well in many cases unless the data gets too big or the computations too heavy. In that case you’ve reached Excel’s limits and need to properly solve it.

An Excel-file often does things one-by-one, with a new command in every cell. This prevents any kind of automatic optimizations – besides that, Excel-sheets are very prone to errors.

Below are the steps to go through, of which most you can do yourself!

Continue reading “How to speed up Excel in 6 steps”