Phoronix OpenCL Benchmark 3.0 beta

So you want OpenCL-benchmarks? Phoronix is a benchmark for OSX and Linux, created by Michael Larabel, Matthew Tippett (http://en.wikipedia.org/wiki/Phoronix_Test_Suite). On Ubuntu Phoronix version 2.8 is in the Ubuntu “app store” (Synaptic), but 3.0 has those nice OpenCL-tests. The tests are based on David Bucciarelli‘s OpenCL demos. Starting to use Phonornix 3.0 (beta 1) is done in 4 easy steps:

  1. Download the latest beta-version from http://www.phoronix-test-suite.com/?k=downloads
  2. Extract. Can be anywehre. I chose /opt/phoronix-test-suite
  3. Install. Just type ./phoronix-test-suite in a terminal
  4. Use.

WARNING: It is beta-software and the following might not work on your machine! If you have problems with this tutorial and want or found a fix, post a reply.

Continue reading “Phoronix OpenCL Benchmark 3.0 beta”

Waiting for Mobile OpenCL – Q1 2011

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

  • Adapting code for embedded profile
  • Added macro __EMBEDDED_PROFILE__
  • CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
  • Online compiler is optional
  • No 64-bit integers
  • Reduced requirements for constant buffers, object allocation, constant argument count and local memory
  • Image & floating point support matches OpenGL ES 2.0 texturing
  • The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011”

Products using OpenCL on ARM MALI are coming

mali-product-feature-CLSDK-940x300_vX1The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

http://www.youtube.com/watch?v=UQXfjvcqiQg

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming”

ARM

ARM is most known for their CPU architectures, but they also have a GPU architecture, MALI. Their devices support:

  • OpenCL
  • OpenGL
  • Vulkan
  • RenderScript

OpenCL

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDKmali-product-feature-CLSDK-940x300_vX1

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

mali-made

Drivers for Android

Software available for the Arndale can be found here – drivers (including graphics drivers with OpenCL-support) are here. The current state is that their OpenCL drivers are sometimes working, most times not – we are very sorry for that and try to find fixes.

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

Firefly RK3288

Firefly-RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

  1. Arndale 5250-A
  2. YICsystem YSE5250
  3. YICsystem SMDKC520
  4. Nexus 10
  5. Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

  • Cortex-A15 @ 1.7 GHz dual core
  • 128 bit SIMD NEON
  • 2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

For more information and to order, go to http://howchip.com/shop/item.php?it_id=AND5250A. For an overview of extensions, go to: http://howchip.com/shop/content.php?co_id=ArndaleBoard_en. The price is $250 for the board, $50 for shipping to Europe, and extension-boards start at $60. You need a VAT-number to get it through the customs, but you have to pay EU-VAT anyway.

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Be sure you can buy a good 5V adapter (the + at the pin). A minimum of 3A is required for the board (TDP of the whole board is 11 to 12 Watt). Adapter costs around $25,- in the store, or you can buy them online for $7,50. You also need a serial cable – there are USB2COM-cables under €20,-. If you are in doubt, buy the $60,- package with cables (no COM2USB), adapter and microSD card.

Board 2: YICsystem YSE5250

This board has 2GB DDR3L(32bit 800MHz) and 8GB eMMC (onboard memory-card), USB 3.0 and LAN. Optional boards are for audio, WIFI, sensors (Gyroscope, Accelerometer, Magnetic, Light & Proximity), 5MB camera, LCD and GPS.

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

SONY DSC

Below you’ll find the layout of the board.

yes5250_layout

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 3: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

smdkc520

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

n10-product-hero

Order information

You can order the Nexus 10 not in all countries, as google has restricted sales channels. See http://www.google.com/nexus/10/ for more info on ordering. With some creativity you can find ways to order this tablet into countries not selected by Google. Price is $400 or €400.

Board 6: Samsung Chromebook

chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

ARM MALI

[infobox type=”information”]

Need a MALI programmer? Hire us!

[/infobox]

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDKmali-product-feature-CLSDK-940x300_vX1

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

mali-made

Drivers for Android

Software available for the Arndale can be found here – drivers (including graphics drivers with OpenCL-support) are here. The current state is that their OpenCL drivers are sometimes working, most times not – we are very sorry for that and try to find fixes.

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

 

Firefly RK3288

Firefly-RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

  1. Arndale 5250-A
  2. ARMBRIX Zero
  3. YICsystem YSE5250
  4. YICsystem SMDKC520
  5. Nexus 10
  6. Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

  • Cortex-A15 @ 1.7 GHz dual core
  • 128 bit SIMD NEON
  • 2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

For more information and to order, go to http://howchip.com/shop/item.php?it_id=AND5250A. For an overview of extensions, go to: http://howchip.com/shop/content.php?co_id=ArndaleBoard_en. The price is $250 for the board, $50 for shipping to Europe, and extension-boards start at $60. You need a VAT-number to get it through the customs, but you have to pay EU-VAT anyway.

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Be sure you can buy a good 5V adapter (the + at the pin). A minimum of 3A is required for the board (TDP of the whole board is 11 to 12 Watt). Adapter costs around $25,- in the store, or you can buy them online for $7,50. You also need a serial cable – there are USB2COM-cables under €20,-. If you are in doubt, buy the $60,- package with cables (no COM2USB), adapter and microSD card.

Board 2: ARMBRIX Zero  cancelled

This board has the same processor, but lacks the extension-boards and wifi/bluetooth. It does have 2GB 800MHz DDR3, various USB-ports, SATA(!) and Ethernet RJ45.

ARMBRIX

Order information

For more information and to order, go to http://www.howchip.com/shop/item.php?it_id=BRIX5250A. The price is $145 for the board, $50 for shipping to Europe. You need a VAT-number to get it through the customs.

Make sure you can buy a good adapter (the + at the pin). The minimum for the Arndale is 3A, but I am not sure this board draws less. Adapter costs around $25,- in the store, or you can buy them online for $7,50.

Board 3: YICsystem YSE5250

This board has 2GB DDR3L(32bit 800MHz) and 8GB eMMC (onboard memory-card), USB 3.0 and LAN. Optional boards are for audio, WIFI, sensors (Gyroscope, Accelerometer, Magnetic, Light & Proximity), 5MB camera, LCD and GPS.

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

SONY DSC

Below you’ll find the layout of the board.

yes5250_layout

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 4: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

smdkc520

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

n10-product-hero

Order information

You can order the Nexus 10 not in all countries, as google has restricted sales channels. See http://www.google.com/nexus/10/ for more info on ordering. With some creativity you can find ways to order this tablet into countries not selected by Google. Price is $400 or €400.

Board 6: Samsung Chromebook

chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

Ask your question

Do you have a question? We are happy to answer all your questions on any subject discussed at this website.

Due to spam floods, we removed the form.

info@streamhpc.com

We try to answer your question within 24 hours.

Targetting various architectures in OpenCL and CUDA

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

  • Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Continue reading “Targetting various architectures in OpenCL and CUDA”

Do your (X86) CPU and GPU support OpenCL?

Does your computer have OpenCL-capable hardware? Read on and find out if your computer is compatible…

If you want to know what other non-PC hardware (phones, tablets, FPGAs, DSPs, etc) is running OpenCL, see the OpenCL SDK page.

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The minimum requirement for OpenCL-on-a-CPU is SSE 4.2, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all the available devices, so you then need to install drivers for each device. CPU-drivers are often included in the GPU-drivers.

Read on to find out exactly which processors are supported.

Continue reading “Do your (X86) CPU and GPU support OpenCL?”

Brochures

We have two brochures. One general and one for training. You can also generate PDFs from most pages on this website to support off-line discussion.

New versions will arrive soon – unfortunately there have been quite some delays.

[columns]
[one_half title=”Training”]

In the brochure for trainings (version January 2012) all training modules are written out.

Download our trainings brochure here.

[/one_half]
[one_half title=”General”]

In the general brochure (version May 2011) we explain both our consultancy and training services.

Download our general brochure here.

[/one_half]
[/columns]

StreamHPC.Brochure.2010-01.EN.pdf

AccelerEyes ArrayFire

There is a lot going on at the path to GPGPU 2.0 – the libraries on top of OpenCL and/or CUDA. Among many solutions we see for example Microsoft with C++ AMP on top of DirectCompute, NVidia (and more) with OpenACC, and now AccelerEyes (most known for their Matlab-extension Jacket and libJacket) with ArrayFire.

I want you to show how easy programming GPUs can be when using such libraries – know that for using all features such as complex numbers, multi-GPU and linear algebra functions, you need to buy the full version. Prices start at $2500,- for a workstation/server with 2 GPUs.

It comes in two flavours: for OpenCL (C++) and for CUDA (C, C++, Fortran). The code for both is the same, so you can easily switch – though you still see references to cuda.h you can compile most examples from the CUDA-version using the OpenCL-version with little editing. Let’s look a little into what it can do.

Continue reading “AccelerEyes ArrayFire”

NVIDIA’s answer to SandyBridge and Fusion

Intel has Sandy Bridge, AMD has Fusion, now NVIDIA has a combination of CPU and GPU too: Project Denver. The only difference is that it is not X86-based, but an ARM-architecture. And most-probable the most powerful ARM-GPU of 2011.

For years there were ARM-based Systems-on-a-chip: a CPU and a GPU combined (see list below). On the X86-platform the “integrated GPU” was on the motherboard, and since this year now both AMD/ATI and Intel hit this “new market”.The big advantage is that it’s cheaper to produce, is more powerful per Watt (in total) and has good acceleration-potential. NVIDIA does not have X86-chips and would have been the big loser of 2011; they did everything to reinvent themselves: 3D was reintroduced, CUDA was actively developed and pushed (free libraries and tools, university-programs, many books and trainings, Tesla, etc), a mobile Tegra graphics solution [1] (see image at the right),  and all existing products got extra backing from the marketing-department. A great time for researchers who needed to get free products in exchange of naming NVIDIA in their research-reports.

NVIDIA chose for ARM; interesting for who is watching the CUDA-vs-OpenCL battle, since CUDA was for GPUs of NVIDIA on X86 and ARM was solely for OpenCL. Period. In the contrary to their other ARM-based chips, this new chip probably won’t be in smartphones (yet); it targets systems that need more GPU-power like CUDA and games.

In a few days the article about Windows-on-ARM is to be released, which completes this article.

Continue reading “NVIDIA’s answer to SandyBridge and Fusion”

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

Thalesians talk – OpenCL in financial computations

End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.

After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.

Continue reading “Thalesians talk – OpenCL in financial computations”

Selecting Applications Suitable for Porting to the GPU

Assessing software is never comparing apples to apples

The goal of this writing is to explain which applications are suitable to be ported to OpenCL and run on GPU (or multiple GPUs). It is done by showing the main differences between GPU and CPU, and by listing features and characteristics of problems and algorithms, which can make use of highly parallel architecture of GPU and simply run faster on graphic cards. Additionally, there is a list of issues that can decrease potential speed-up.

It does not try to be complete, but tries to focus on the most essential parts of assessing if code is a good candidate for porting to the GPU.

GPU vs CPU

The biggest difference between a GPU and a CPU is how they process tasks, due to different purposes. A CPU has a few (usually 4 or 8, but up to 32) ”fat” cores optimized for sequential serial processing like running an operating system, Microsoft Word, a web browser etc, while a GPU has a thousands of ”thin” cores designed to be very efficient when running hundreds of thousands of alike tasks simultaneously.

A CPU is very good at multi-tasking, whereas a GPU is very good at repetitive tasks. GPUs offer much more raw computational power compared to CPUs, but they would completely fail to run an operating system. Compare this to 4 motor cycles (CPU) of 1 truck (GPU) delivering goods – when the goods have to be delivered to customers throughout the city the motor cycles win, when all goods have to be delivered to a few supermarkets the truck wins.

Most problems need both processors to deliver the best value of system performance, price, and power. The GPU does the heavy lifting (truck brings goods to distribution centers) and the CPU does the flexible part of the job (motor cycles distributing doing deliveries).

Assessing software for GPU-porting fitness

Software that does not meet the performance requirement (time taken / time available), is always a potential candidate for being ported to a GPU. Continue reading “Selecting Applications Suitable for Porting to the GPU”

A list of Desktop GPU architectures

p3-architectureUPDATED in February 2017

Some optimisation tricks work really well on one architecture, and are useless on others. And even with better drivers, the older architectures need some help. In other words, it helps to know what architecture the GPU has. Therefore you get some help from your friends at StreamHPC.

Below you’ll find a list of the architecture names of all OpenCL-capable GPU models of Intel, NVIDA and AMD. It does not contain the professional lines for now – first we are focusing on getting the general models right.

Understand it took a lot of time to gather the below information, and normally we share such information only with our clients.

Continue reading “A list of Desktop GPU architectures”

Why use OpenCL on FPGAs?

9781118942208.pdfAltera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here  from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just fill in the form on Altera’s webpage.

In StreamHPC we’re interested in OpenCL on FPGAs for one reason: many companies run their software on GPUs, when they should be using FPGAs instead; and at the same time, others stick to FPGAs and ignore GPUs completely. The main reason, we think, is that converting CUDA to VHDL, or Verilog to CPU intrinsics, is simply too painful. Another reason can be seen in the a amount of investment put on a certain technology. We believe that OpenCL can solve both of these issues. OpenCL is much more portable and can be converted to a new architecture in a relatively short time (if the developer is familiar with the project, the hardware and OpenCL). We have high familiarity with these two latter, which means we’re used to get new projects up-and-running.

Since both Altera and Xilinx have invested in OpenCL, the two FPGAs code has become more portable now. Altera has a public SDK (and they’re proudly loud about it), while Xilinx offers it in their latest tools (although they’re unfortunately much more silent about it).

Now, let us now go back to the quotes from the book that we wanted to share with you.

Andrew Moore describes OpenCL effectively in just a few sentences:

The need for heterogeneous computing is leading to new programming languages to exploit the new hardware. One example is the OpenCL first developed by Apple, Inc. OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs, and other types of processors. OpenCL includes a language for developing kernels (functions that execute on hardware devices) as well as application programming interfaces (APIs) that define and control the various platforms. OpenCL allows for parallel computing using task-based and data-based parallelism.

The author also shares some interesting insights around the reasons why OpenCL should be used on FPGA:

FPGAs are inherently parallel, so they’re a perfect fit with OpenCL’s parallel computing capabilities. FPGAs give you an alternative to the typical data or task parallelism by offering a pipeline parallelism where tasks can be spawned in a push-pull configuration with each task using different data from the previous task with or without host interaction. OpenCL allows you to develop your code in the familiar C programming language but using the additional capabilities provided by OpenCL. These kernels can be sent to the FPGAs without your having to learn the low-level HDL coding practices of FPGA designers. Generally, there are several benefits for software developers and system designers to use OpenCL to develop code for FPGAs:

  • Simplicity and ease of development: Most software developers are familiar with the C programming language, but not low-level HDL languages. OpenCL keeps you at a higher level of programming, making your system open to more software developers.
  • Code profiling: Using OpenCL, you can profile your code and determine the performance-sensitive pieces that could be hardware accelerated as kernels in an FPGA.
  • Performance: Performance per watt is the ultimate goal of system design. Using an FPGA, you’re balancing high performance in an energy-efficient solution.
  • Efficiency: The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives.
  • Heterogeneous systems: With OpenCL, you can develop kernels that target FPGAs, CPUs, GPUs, and DSPs seamlessly to give you a truly heterogeneous system design.
  • Code reuse: The holy grail of software development is achieving code reuse. Code reuse is often an elusive goal for software developers and system designers. OpenCL kernels allow for portable code that you can target for different families and generations of FPGAs from one project to the next, extending the life of your code.

Today, OpenCL is developed and maintained by the technology consortium Khronos Group. Most FPGA manufacturers provide Software Development Kits (SDKs) for OpenCL development on FPGAs.

You can continue here if you want to read of this ebook. And  of course, whenever you want to learn some more more, feel free to write to us, or follow this conversation on Twitter, which goes on through our special account: @OpenCLonFPGAs.

The current state of WebCL

Years ago Microsoft was in court as it claimed Internet Explorer could not be removed from Windows without breaking the system, while competitors claimed it could. Why was this so important? Because (as it seems) the browser would get more important than the OS and internet as important as electricity in the office and at home. I was therefore very happy to see the introduction of WebGL, the browser-plugin for OpenGL, as this would push web-interfaces as the default for user-interfaces. WebCL is a browser-plugin to run OpenCL-kernels. Meaning that more powerful hardware-devices are available to JavaScript. This post is work-in-progress as I try to find more resources! Seen stuff like this? Let me know.

Continue reading “The current state of WebCL”

PDFs of Monday 12 September

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza).  30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

Qt Creator OpenCL Syntax Highlighting

With highlighting for Gedit, I was happy to give you the convenience of a nice editor to work on OpenCL-files. But it seems that one of the most popular IDEs for C++-programming is Qt Creator. So you receive another free syntax highlighter. You need at least Qt Creator 2.1.0.

The people of Qt have written everything you need to know about their Syntax highlighting, which was enough help to create this file. You see that they use the system of Kate, so logically this file works with this editor too.

In this article there is all you need to know to use Qt Creator with OpenCL.

Installing

First download the file to your computer.

Under Windows and OSX you need to copy this file to the directory shareqtcreatorgeneric-highlighter in the Qt installation dir (i.e. c:Qtqtcreator-2.2.1shareqtcreatorgeneric-highlighter). Under Linux copy this file to ~/.kde/share/apps/katepart/syntax or to /usr/share/kde4/apps/katepart/syntax (all users). That’s all, have fun!

Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs

ZiiLabs has been offering an early access program for OpenCL SDK since last year. This program was very selective in choosing developers and little news has been put on their webpage. Now they are planning to make their Android NDK a standard component, it’s a good time to ask them some questions. GPGPU-consultant Liad Weinberger of Appilo also added a few questions.

The Q&A has been with Tim Lewis, director Marketing and Partner Relations of ZiiLabs, who has taken the time to give some insights in what we can expect around accelerated computations on Android. ZiiLabs has been better known as 3DLabs and has reinvented itself in 2009 (you can read the full history here). Like other companies in the ARM-industry they mostly design chips and let other parties manufacture devices using their schematics, drivers and software. Now to the questions.

Continue reading “Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”