General articles on technical subjects.

Professional and Consumer Media Software using OpenCL

Posted by Vincent Hindriksen on 28 December 2013

More and more professional media software now has support for OpenCL. It starts to be a race where you cannot stay behind. If the competitor runs more than twice as fast on the same hardware, then you just can’t say “Sorry, you should buy NVIDIA hardware”. I expected this to happen, but could not tell in what industry they would run fastest. Seems it is fluid dynamics, video-editors and photo-editors.

AMD and Intel mostly have been selected as collaboration partners. Apple has been a main drive, especially with the introduction of their new MAC Pro with two high-end AMD FirePro GPUs.

Sony Catalyst Family

Sony released three new software packages to support video professionals in pre- and post-production.

This new family of products, Catalyst Browse (media management), Catalyst Prepare (video preproduction assistant) and Catalyst Edit (4K and Sony RAW video editing) has OpenCL support from the start.

Colorfront Express Dailies and On-Set Dailies

This software is an on-set dailies processing system (playback and sync, QC, colour grading, audio and metadata management).

The 2014 versions have OpenCL support in their transcoder plugin, Transkoder.

RED REDCINE-X PRO

REDCINE-X is a coloring toolset, integrated timeline, and post effects collection in a professional, flexible environment for your 4K or 5K .R3D files. RED has added support for OpenCL in build 22.

The Foundry Nuke Blink framework

As presented on GPUconf, The Foundry has opened their framework for running OpenCL kernels. It creates OpenCL-kernels (optimised for AMD or NVIDIA) from C++ Blink kernels.

NUKE studio is a node-based VFX, editorial and finishing studio. As with most products on this page, look for the “reel” to get a nice demo of its capabilities.

Magix Hybrid Video Engine

Video Deluxe and Movie Edit both have OpenCL support since 2012, thanks to the new shared video engine.

Adobe CS6 creative suite

Adobe has entered the OpenCL market publicly with . With Premiere Pro (video editing) and Photoshop (photo-editing) two main products with advanced GPU-acceleration via OpenCL.

Video on GPU-effects on Premiere Pro CS5.

FAQ on GPU-acceleration on Photoshop CS6.

Sony Vegas Pro

Vegas Pro is a video editing software package for non-linear editing systems, and has OpenCL support since version 10d. Also in the consumer version (Sony Movie Studio) there is OpenCL-support.

RealFlow Hybrido2 engine

RealFlow is fluid dynamics software and its new engine Hybrido2 has support for OpenCL since this year. And you just have to love their commercial videos.

Autodesk Maya

Maya is a toolsets to help create and maintain the modern, open pipelines you need to address today’s challenging 3D animation, visual effects, game development and post-production projects. Since the 2013 version it is accelerated for physics simulations via Bullet and OpenCL.

ArcSoft SimHD and Sim3D engine

ArcSoft media-engines SimHD and Sim3D have OpenCL support since several years and are used in several of their products.

BlackMagic Design

BMD has two suites which use OpenCL, Resolve and Fusion. DaVinci was acquired in 2009 and EyeOn in 2014.

(DaVinci) Resolve

Resolve has real-time colour correction thanks to OpenCL.

(Eyeon) Fusion

Fusion is an image compositing software program created by eyeon Software Inc. It is typically used to create visual effects and digital compositing for film, HD and commercials.

It uses OpenCL since version 6.

Roxio Creator Suite

Roxio uses OpenCL for accelerated rendering in their suite. They were one of the first to implement OpenCL – I think already in 2010, before OpenCL was even cool.

Unluckily they don’t have much information – just a mention that they have support.

Apple Final Cut Pro and iMovie

Apple has support in Final Cut Pro X, Motion 5 and Compressor 4.

Also iMovie works a lot faster when you have an OpenCL capable MAC.

Blender Cycles & Bullet

You cannot find any demonstration of new video hardware without Big Bucks Bunny, the short CG movie created with Blender.

It uses OpenCL in two parts: physics simulations (Bullet) and compositor (Cycles).

Side Effects Houdini

Houdini is a procedural node based 3D animation and visual effects tools for film, broadcast, entertainment and visualisation production.

DigitalFilmTools

There is support for OpenCL in zMatte, Composite Suite Pro and Film Stocks since Q4 2013.

zMatte is a keyer for blue and green screen composites. Composite Suite Pro is a collection of visual effects plug-ins. Film Stocks simulates color and black and white still photographic film stocks, motion picture films stocks and historical photographic processes.

OTOY OctaneRender 3

OctaneRender is a GPU-based, real-time 3D, unbiased rendering application. In March 2015 OTOY announced OctaneRender 3, which has full OpenCL support:

OpenCL support: OctaneRender 3 will support the broadest range of processors possible using OpenCL to run on Intel CPUs with support for out-of-core geometry, OpenCL FPGAs and ASICs, and AMD GPUs.

Below is a reel of OcateRender 2 with CUDA. According to OTOY the performance on AMD and NVidia is comparable.

SAM Alchemist XF

Alchemist XF supports format and framerate conversion from SD up to 4K for a wide variety of file formats at high speed.

More?

There is a lot more OpenCL-powered software coming up rapidly (we hear things). But we also missed (or accidentally forgot) software. Please help making this list complete and send us an email.

OpenCL SPIR by example

Posted by Vincent Hindriksen on 27 December 2013 with 1 Comment

OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

Continue reading →

SC14 Workshop Proposals due 7 February 2014

Posted by Vincent Hindriksen on 17 December 2013

Just to let you know that there should be even more OpenCL and related technologies on SC14

Are you interested in hosting a workshop at SC14?

Please mark your calendars as SC will be accepting proposals from 1 January – 7 February for independently planned full-, half-, or multi-day workshops.

Workshops complement the overall SC technical program. The goal is to expand the knowledge base of practitioners and researchers in a particular subject area providing a focused, in-depth venue for presentations, discussion and interaction. workshop proposals will be peer-reviewed academically with a focus on submissions that wil inspire deep and interactive dialogue in topics of interest to the HPC community.

For more information, please consult: http://sc14.supercomputing.org/program/workshops

Important Submission Info

Web Submissions Open: 1 January 2014
Submission Deadline: 7 February 2014
Notification of acceptance: 28 March 2014

Submit proposals via: https://submissions.supercomputing.org/
Questions: workshops@info.supercomputing.org

We’re thinking of proposing one too. Want to collaborate? Mail us! Don’t forget, to go to HiPEAC (20 January) and IWOCL (12 May) to meet us!

FortranCL working example

Posted by Vincent Hindriksen on 16 December 2013 with 2 Comments

f90 — The ’96 book is still available here, and has some good explanations of numerical mathematics. Oh, the good old times..

Last week I needed to get Fortran working with OpenCL. As the example-page is not up-to-date and not much documentation is on the interwebs outside the official page, this was not as straight-forward as I hoped. The test-suite and this article provided code I could actually use. First I wanted to have things in a module, second I needed to control which device I wanted to use, third I needed function-names that could be used in a larger project. The result is below, and hopefully usable for the Fortran folks around who want to add some OpenCL-kernels to their existing code.

It uses the two-step initialisation we know from C, for safe memory allocation. It is based on the utils.f90 from the test-suite.

The only good way to translate is the Rose-compiler – which is a pain to install. I tried various f2c-scripts (from the 90’s, but they all failed. I must say that continuous switching between Fortran-mode and C-mode was the hardest part of the porting.

If you have tips&tricks to use OpenCL from Fortran, let everybody know in the comments. Also let me know if the code doesn’t work for you, or you have improvements (like better error-handling).

The rest of utils.f90 (which I renamed to clutils.f90 for better integration) is mostly the same – only this subroutine needed changes:

(...)

subroutine cl_initialize(platform_id, device_id, device, context, command_queue)
!use ISO_C_BINDING
type(cl_device_id),     intent(out)     :: device
type(cl_context),       intent(out)     :: context
type(cl_command_queue), intent(out)     :: command_queue
integer                                 :: platform_id
integer                                 :: device_id

integer :: platform_count, device_count, ierr
character(len = 100) :: info
type(cl_platform_id) :: platform
type(cl_platform_id), allocatable, target :: platform_ids(:)
type(cl_device_id), allocatable, target :: device_ids(:)

! get the platform ID
call clGetPlatformIDs(platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')
allocate(platform_ids(platform_count))
call clGetPlatformIDs(platform_ids, platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')

if (platform_id .gt. platform_count .or. platform_id .lt. 1) platform_id = 0
platform = platform_ids(platform_id)

! get the device ID
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')
allocate(device_ids(device_count))
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_ids, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')

if (device_id .gt. device_count .or. device_id .lt. 1) device_id = 1
device = device_ids(device_id)

! get the device name and print it
call clGetDeviceInfo(device, CL_DEVICE_NAME, info, ierr)
print*, "CL device: ", info

! create the context and the command queue
context = clCreateContext(platform, device, ierr)
command_queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, ierr)

end subroutine cl_initialize

(...)

Continue reading “FortranCL working example” →

The OpenCL event of the year: IWOCL 2014 – Bristol, UK, 12 & 13 May

Posted by Vincent Hindriksen on 13 December 2013

Khronos has supported and organised for the second time the International Workgroup on OpenCL (IWOCL, pronounced as “eye-wok-ul”). Last year the event took place at Georgia Tech, Atlanta, Georgia, in the United States. This year the event will be held in Europe: Bristol University, Bristol, England, UK.

IWOCL 2013 Presentations

Last year there was a varying programme:

Porting a Commercial Application to OpenCL: A Case Study
Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor
Parallelization of the Shortest Path Graph Kernel on the GPU
OpenCL-based Approach to Heterogeneous Parallel TSP Optimization
clMAGMA: High Performance Dense Linear Algebra with OpenCL
Multi-Architecture ISA-Level Simulation of OpenCL
Optimizing OpenCL Applications on the Intel Xeon Phi

You can see and download these presentations here. This year the organisation tries to offer a equally exciting programme.

Workshop means it’s an active event

It’s all about sharing, but not just by letting you sit and listen. Below you’ll find some of the options.

Present your work

Did you use OpenCL in your software or research? You are very welcome to present your experience and results. IWOCL is the premier forum for the presentation and discussion of new designs, trends, algorithms, programming models, software, tools and ideas for OpenCL.

Abstract Submission Deadline: Friday 31 January, 2014

It can be in the form of:

Research paper
Technical presentation
Workshops and Tutorial
Poster

(StreamHPC’s Vincent Hindriksen is on the Conference Sessions Committee)

Communicate with the workgroup

20-P1020816 — Khronos booth at SC13 – some you will see again at IWOCL

The OpenCL workgroup likes to communicate with OpenCL’s users. IWOCL provides a formal channel for community feedback to the Khronos Group’s OpenCL workgroup. This is one of the best moments to be heard, discuss a hack/bug or share a great idea that should be in the next version of OpenCL.

Meet OpenCL developers and enthusiasts

During the breaks, social events and during presentations, you can discuss all your ideas and thoughts on on-topic and off-topic subjects, or you can also join existing talks.

If you are new into compute acceleration, you’ll find many people who are willing to explain what it does and add their personal view.

Test-drive software

We will bring some hardware, on which you can test your kernels. (We’ll put more info about this later!)

Sponsor and present your product

There will be booths available for the sponsors, where you can show your product to the public.

Stay up to date on the event

We will try keep you up-to-date as much as possible, but IWOCL has some channels to keep you informed:

We’ll put on a link when tickets are ready to be sold.

Let others know you plan to be on the event by saying hi in the comments.

Hope to see you there!

The Exascale rat-race vs getting-things-done with HPC

Posted by Vincent Hindriksen on 11 December 2013

slide-12-638 — IDC Forecasts 7 Percent Annual Growth for Global HPC Market – HPCwire

When the new supercomputer “Cartesius” of the Netherlands was presented to the public a few months ago, the buzz was not around FLOPS, but around users. SARA CEO Dr. Ir. Anwar Osseyran kept focusing on this aspect. The design of the machine was not pushed by getting into the TOP500, but by improving the performance of the current users’ software. This was applauded by various HPC experts, including StreamHPC. We need to get things done, not to win a virtual race of some number.

In the description about the supercomputer, the top500-position was only mentioned at the bottom of the page:

Cartesius entered the Top500 in November 2013 at position 184. This Top500 entry only involves the thin nodes resulting in a Linpack performance (Rmax) of 222.7 Tflop/s. Please note that Cartesius is designed to be a well balanced system instead of being a Top500 killer. This is to ensure maximum usability for the Dutch scientific community.

What would happen if you go for a TOP500 supercomputer? You might get a high energy bill and an overpriced, inefficient supercomputer. The first months you will not have full usage of the machine, and you won’t be able to easily turn off some parts, hence the spill of electricity. This results, finally, in that it is better to run unoptimized code on the cluster than to take time for coding.

The inefficiency is due to the fact that some software is data-transfer limited and other is compute-limited. No need to explain that if you go for a Top 500 and not for software optimized design, you end up buying extra hardware to get all kinds of algorithms performing. Cartesius therefore has “fat nodes” and “light nodes” to get the best bang per buck.

There is also a plan for expanding the machine over the years (on-demand growth), such that the users will remain happy instead of having an adrenaline-shot at once.

The rat-race

The HPC Top 500 is run by the company behind ISC-events. They care about their list being used, not if there is Exascale now or later. There is one company who has a particular interest in Exascale: Intel and IBM. It hardly matters anymore how it begun. What is interesting is that Intel has bought Infiniband and is collecting companies that could make them the one-stop shop for a HPC-cluster. IBM has always been strong in supercomputers with their BlueGene HPC-line. Intel has a very nice infographic on Intel+Exascale, which shows how serious they are.

But then the big question comes: did all this pushing speed up the road to Exascale? Well, no… just the normal peaks and lows round the logarithmic theoretic line:

Top500-exponential-growth — source: CNET

What I find interesting in this graph is that the #500 line is diverging from the #1 line. With GPGPU is would was quite easy to enter the top 500 3 years ago.

Did the profits rise? Yes. While PC-sales went down, HPC-revenues grew:

Revenues in the high-performance computing (HPC) server space jumped 7.7 percent last year to $11.1 billion surpassing the $10.3 billion in revenues generated in 2011, according to numbers released by IDC March 21. This came despite a 6.8 percent drop in shipments, an indication of the growing average selling prices in the space, the analysts said. (eWeek.)

So, mainly the willingness of buying HPC has increased. And you cannot stay behind when the rest of the world is focusing on Exascale, can you?

Five HPC Pitfalls – benchmarks, hardware generalisation, open software, system integration and storage aspects
How to sell performance computing
Green 500 – where FLOPS/Watt are important, not peak-performance.
The 13 application areas where GPUs perform best.
Chinese supercomputer tops the charts — two years early

Keep your feet on the ground and focus on what matters: papers and answers to hard questions.

Did you solve a compute problem and got published with an sub-top250 supercomputer? Share it in the comments!

Basic concepts: malloc in the kernel

Posted by Vincent Hindriksen on 30 November 2013

22489954_ml — Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,

cl_uint arg_index,

size_t arg_size,

const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Partner up with StreamHPC for Horizon 2020!

Posted by Vincent Hindriksen on 20 November 2013

For those working in a research-department at a company or university within the EU, Horizon 2020 might be a bit of a familiar sound.

For us this is an important program and a source possibilities for collaboration in the coming years. Our expertise in enabling ultra-fast computations, combined with your expertise can make Europe more competitive. We are interested in applied GPGPU, in the commercialization of tools, in co-developing new software with SMEs and also in universities based on the EU, Switzerland or Israel.

Fields Europe wants to focus on:

Micro- and nano-electronics; photonics
Nanotechnologies
Advanced materials
Biotechnology
Advanced manufacturing and processing

The development of these technologies requires a multi-disciplinary knowledge and a capital-intensive approach.

Each of these industries has opportunities for using GPus and Accelerators.

Events are organised throughout Europe to inform universities and companies about the programme. Are you a Dutch university or company? Check this site of the Dutch government.

StreamHPC launches monthly trainings in Europe

Posted by Vincent Hindriksen on 19 November 2013

Every second Monday of the month StreamHPC offers an OpenCL training in Mathematics or Media-operations. The target is OpenCL 1.2 (or 1.1 when NVIDIA is discussed). OpenCL 2.0 trainings will start in Q2/Q3, or when on request. All trainings will be given by experienced OpenCL developers/trainers.

Trainings

Trainings take 3½ days, from basics on the first morning to the special requests on the 4th morning, either with your own laptop, or logging to a compute-server.

The Media-operations module is based on Heterogeneous Computing with OpenCL, second edition by Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa.

It covers convolution, video-processing, histogram and mixed particle simulation. Extra subjects are OpenCL-OpenGL interop and code-optimisation.

A good fit if you work with images, sound and video.

The Mathematics module is based on OpenCL in Action by Matthew Scarpino.

It covers reduction, sorting, matrix-operations and signal processing.

If you work on graphs, matrices and data-manipulation, this is for you!

Continue reading “StreamHPC launches monthly trainings in Europe” →

OpenCL at SC13

Posted by Vincent Hindriksen on 15 November 2013

Unluckily I am not at SC13, so I’ll enjoy from a distance. Luckily I don’t miss one of the most beautiful 15km runs in the Netherlands. When there is more news, I’ll add to this post – below is mostly taken from the Khronos website and the SC2013 website.

OpenCL Booth

Meet members of the OpenCL workgroup in Booth #4137 to get hot news from the OpenCL experts and an OpenCL reference card. Also learn about next year’s plans for IWOCL (International Workshop on OpenCL). Be sure not to miss the BOF on OpenCL 2.0 (in bold in the schedule).

Schedule

Sunday, 17 November
8:30 – 17:00	Tutorials	Structured Parallel Programming with Patterns	Michael McCool, James Reinders, Arch Robison, Michael Hebenstreit	302
8:30 – 17:00	Tutorials	OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools	Luiz DeRose, Alistair Hart, Heidi Poxon, James Beyer	401
Monday, 18 November
8:30 – 17:00	Tutorials	OpenCL: A Hands-On Introduction	Tim Mattson, Alice Koniges, Simon McIntosh-Smith	403
Tuesday, 19 November
11:00 & 15:00	ACM Student Research Competition Poster Reception	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
17:15 – 19:00	Short presentation	local_malloc: malloc for OpenCL __local memory [Poster]	John Kloosterman	Mile High Pre-Function
Wednesday, 20 November
11:00 & 15:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00	Case Study	Accelerating Full Waveform Inversion via OpenCL on AMD GPUs	AcceleWare	AMD booth
17:30 – 19:00	BOF	OpenCL: Version 2.0 and Beyond	Tim Mattson, Ben Bergen, Simon McIntosh-Smith	405/406/407
Thursday, 21 November
11:30 – 12:00	Exhibitor Forum	OpenCL 2.0: Unlocking the Power of Your Heterogeneous Platform	Tim Mattson	501/502
10:30 – 11:00	Papers	General Transformations for GPU Execution of Tree Traversals	Michael Goldfarb, Youngjoon Jo, Milind Kulkarni	205/207
11:00 – 11:30	Papers	A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening	Alberto Magni, Christophe Dubach, Michael F.P. O’Boyle	205/207
11:30 – 12:00	Papers	Semi-Automatic Restructuring of Offloadable Tasks for Many-Core Accelerators	Nishkam Ravi, Yi Yang, Tao Bao, Srimat Chakradhar	205/207
11:30 – 12:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00 – 16:30	Paper	Accelerating Sparse Matrix-Vector Multiplication on GPUs using Bit-Representation-Optimized Schemes	Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, Weng-Fai Wong	401/402/403

There are other interesting SC13-events around OpenCL, so be sure to check the schedule carefully.

Khronos Members Exhibiting at SC13

Complete floor plan is available here.

Below links are updated from the company-homepages to special SC13 landing-pages.

Altera Corporation – Booth 4332.
- BittWare will be showcasing their TeraBox High Performance Reconfigurable Computing Platform, which runs OpenCL.
- Acceleware. See schedule above.
AMD – Booth 1113. See this list for a schedule.
ARM – Booth 3141.
- OpenCL on MALI demos at their booth
- AccelerEyes (#310) showcases ArrayFire (with OpenCL-on-ARM backend) running on Mali T604.
- Many more – just look for it.
Khronos OpenCL – Booth 4137.
IBM – Booth 126, 2713. OpenCL on IBM PowerLinux 7R2 and IBM Flex System. Also collaboration with Altera.
Intel – Booth 2501, 2701. OpenCL on their CPUs and GPUs.
NEC Corporation – 3109. Vector supercomputer – Khronos probably hints to OpenCL running on this machine – see for yourself.
NVIDIA – Booth 613. Have fun hearing them say: “Do you use CUDA or are you locked-in to OpenCL?” and variations on this.
Texas Instruments – Booth 3725. OpenCL on DSP demo.
XilinX – To schedule a private appointment (for an OpenCL-demo) visit Xilinx at the Convey booth (#3547) or the Alpha Data booth (#4237).

The floorplan can be downloaded here or here (mirrored on 15-Nov).

At SC13 and saw great OpenCL demos or news?

Share this info and photos in the comments, for others to pick up.

CUDA 6 Unified Memory explained

Posted by Vincent Hindriksen on 14 November 2013 with 11 Comments

AMD, ARM-vendors and Intel have been busy unifying CPU and GPU memories for years. It is not easy to design a model where 2 (or more) processors can access memory without dead-locking each other.

NVIDIA just announced CUDA 6 and to my surprise includes “Unified Memory”. Am missing something completely, or did they just pass their competitors as it implies one memory? The answer is in their definition:

Unified Memory — Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

The official definition is:

Unified Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

See the difference?

The image at the right explains it differently. A) is how UMA is officially defined, and B is how NVIDIA has redefined it.

So NVIDIA’s Unified Memory solution is engineered by marketeers, not by hardware engineers. On Twitter, I seem not to be the only one who had the need to explain that it is different from the terminology the other hardware-designers have been using.

So if it is not unified memory, what is it?

It is intelligent synchronisation between CPU and GPU-memory. The real question is what the difference is between Unified Virtual Addressing (UVA, introduced in CUDA 4) and this new thing.

UVA defines a single Address Space, where CUDA takes care of the synchronisation when the addresses are physically not on the same memory space. The developer has to give ownership to or the CPU or the GPU, so CUDA knows when to sync memories. It does need CudeDeviceSynchronize() to trigger synchronisation (see image).

From AnandTech, which wrote about Unified (virtual) Memory:

This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.

So its to attract new developers, and then later taking care of them being bad programmers? I cannot agree, even if it makes GPU-programming popular – I don’t bike on highways.

From Phoronix, which discussed the changes of NVIDIA Linux driver 331.17:

The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA’s CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM.

So it is UVM 2.0, but without any API-changes. That’s clear then. It simply matters a lot if it’s true or virtual, and I really don’t understand why NVIDIA chose to obfuscate these matters.

In OpenCL this has to be done explicitly with mapping and unmapping pinned memory, but is very comparable to what UVM does. I do think UVM is a cleaner API.

Let me know what you think. If you have additional information, I’m happy to add this.

AMD updates the FirePro S10000 to 12GB and passive cooling

Posted by Vincent Hindriksen on 14 November 2013

Let the competition on large memory GPUs begin!

Some algorithms and continuous batch processes will have the joy of the extra memory. For example when inverting a large matrix or doing huge simulations, you need as much memory as possible. or to avoid memory-bank conflicts by duplicating data-objects (possible only when the data is in memory for a longer time to pay for the time it costs to duplicate the data).

Another reason for larger memories is dual precision computations (this one has a total of 1.48 TFLOPS), which doubles memory-requirements. With Accelerators getting better fit for HPC (true support for IEEE-754 double precision storage format, ECC-memory), memory-size becomes one of limits that needs to be solved.

The other choice is swapping on GPUs or to use multi-core CPUs. Swapping is not an option as it nulls all the speed-up. A server with 4 x 16-core CPUs are as expensive as one Accelerator, but use more energy.

AMD seems to have identified this as an important HPC-market therefore just announced the new S10000 with 12GB of memory. To be mailed at AMD-partners in January, and on the market in April. Is AMD finally taking the professional HPC market serious? They now do have the first 12GB GPU-accelerator built for servers.

Old vs New

Still a few question-marks, unfortunately

Functionality	FirePro S10000 6GB	FirePro S10000 12GB
GPU-Processor count	2	2
Architecture	Graphics Core Next	Graphics Core Next
Memory per GPU-processor	3 GB GDDR5 ECC	6GB GDDR5 ECC
Memory bandwidth per GPU-processor	240 GB/s per GPU	240 GB/s per GPU
Performance (single precision, per GPU-proc.)	2.95 TFLOPS per GPU	2.95 TFLOPS per GPU
Performance (double precision, per GPU-proc.)	0.74 TFLOPS per GPU	0.74 TFLOPS per GPU
Max power usage for whole dual-GPU card	325 Watt	325 Watt (?)
Greenness for whole dual-GPU card (SP)	20.35 GFLOPS/Watt	18.15 GFLOPS/Watt
Bus Interface	PCIe 3.0 x16	PCIe 3.0 x16
Price for whole dual-GPU card	$3500	?
Price per GFLOPS (SP)	$0.60	?
Price per GFLOPS (DP)	$2.43	?
Cooling	Active (!)	Passive

The biggest differences are the doubling of memory and the passive cooling.

Competitors

Biggest competitor is the Quadro K6000, which I haven’t discussed at all. That card throws out 5.2 TFLOPS using one GPU, being able to access all 12GB of memory via a 384-bit bus at 288 GB/s (when all cores are used). It is actively cooled, so it’s not really fit for servers (like the S10000, 6GB version). The S10000 has a higher bandwidth, but cannot access only half the 12GB from one core at full speed. So the K6000 has the advantage here.

Intel is planning to have 12GB and 16GB XeonPhi’s. I’m curious to more benchmarks of the new cards, as the 5110P does not have very good results (benchmark 1, benchmark 2). It compares more to a high-end Xeon CPU than a GPU. I am more enthusiastic about the OpenCL-performance on their CPUs.

What’s next on this path?

A few questions I asked myself and tried to find answers on.

Extendible memory, like we have for CPUs? Probably not, as GDDR5 is not designed to be upgradable.

Unified memory for multi-GPUs? This would solve the disadvantage of multi-die GPU-cards, as 2, 4 or more GPUs could share the same memory. A reason to watch HSA hUMA‘s progress, which now specifies unified memory access between GPU and CPU.

24GB of memory or more? I’ve found below graph to have an idea of the costs of GDDR-memory, so it’s an option. These prices are of course excluding supplementary parts and R&D-costs for getting more memory accessible to the GPU-cores.

At least the question we are going to get answered now: is the market which needs this amount of memory large enough and thus worth serving.

Is there more need for wider memory-bus? Remember that GDDR6 is promised for 2014.

What do you think of a 12GB GPU? Do you think this is the path that distinguishes professional GPUs from desktop-GPUs?

CUDA’s multiple targets, the OpenCL version

Posted by Vincent Hindriksen on 13 November 2013 with 3 Comments

I’d like to share two images.

The following image is being shared for quite some time, to show the technical capabilities of CUDA.

I replaced “CUDA source” by “OpenCL source” and worked from there. Result:

I know it is not optimised for a certain architecture, but neither is the CUDA-source for the two extra targets.

Upcoming soon is an article, where the real stuff is currently happening: the higher-level languages being built on top of OpenCL.

Altera published their OpenCL-on-FPGA optimization guide

Posted by Vincent Hindriksen on 11 November 2013

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

Subjects

The following subjects and optimisation tricks are discussed:

FPGA Overview
Pipelines
Good Design Practices
Avoid Pointer Aliasing
Avoid Expensive Functions
Avoid Work-Item ID-Dependent Backward Branching
Aligned Memory Allocation
Ensure 4-Byte Alignment for All Data Structures
Maintain Similar Structures for Vector Type Elements
Optimization of Data Processing Efficiency
Specify a Maximum Work-Group Size or a Required Work-Group Size
Loop Unrolling
Resource Sharing
Kernel Vectorization
Multiple Compute Units
Combination of Compute Unit Replication and Kernel SIMD Vectorization
Resource-Driven Optimization
Floating-Point Operations
Optimization of Memory Access Efficiency
General Guidelines on Optimizing Memory Accesses
Optimize Global Memory Accesses
Perform Kernel Computations Using Constant, Local or Private Memory
Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

ARM forums to find useful information for OpenCL development

Posted by Vincent Hindriksen on 11 November 2013

OpenCL on ARM is hot, but it just is getting started. Currently it takes some time to find needed information about the processors concerning

For OpenCL-discussions the best place is the Khronos OpenCL board. So where can you go when you want to ask questions specifically on ARM-based GPUS like MALI, PowerVR, Adreno and Vivante?

ARM’s new community site for all

ARM just launched the Connected Community (ARM CC). It is the place to connect to, when you have general information-needs of ARM-IP, such as ARM MALI, Cortex A9 and Cortex A15.

And here is how ARM themselves explains this initiative on one slide:

Be sure to connect to StreamHPC. We hope this will indeed be the central place for the whole ecosystem, including Imagination, Qualcomm and Vivante.

ARM MALI

The MALI Developer Center has its forums on ARM Connected Community.

Imagination PowerVR

The graphics-section of their developer forums seems to be the best place.

(Not @ ARM CC)

Qualcomm Adreno

Qualcomm has dev-forums too and has a section called Mobile Gaming & Graphics Optimization (Adreno™).

(Not @ ARM CC)

Vivante

Vivante does not have a forum, but Freescale does. The i.MX forums seem to be the best place to ask your questions.

@ARM CC

Others

Where do find a good source to find and share interesting information on mobile GPUs? Share it with the others via the comments – chances increase your questions gets answered when more people visit the forums.

Guest-blog: Accelerating sequential machine vision algorithms with OpenMP and OpenCL

Posted by Vincent Hindriksen on 11 November 2013 with 1 Comment

Guest-blogger Jaap van de Loosdrecht wants to share his thesis with you. He leads the Centre of Expertise in Computer Vision department at NHL University of applied sciences and is the owner of his own company, and still managed to study and write a MSc-thesis. The thesis is interesting because it extensively compares OpenCL with OpenMP, especially chapters 7 an 8.

For those who are interested, my thesis “Acceleration sequential machine vision algorithms using commodity parallel hardware” is available at www.vdlmv.nl/thesis.

Keywords: Computer Vision, Image processing, Parallel programming, Multi-core CPU, GPU, C++, OpenMP, OpenCL.

Many other related research projects have considered using one domain specific algorithm to compare the best sequential implementation with the best parallel implementation on a specific hardware platform. This work was distinctive because it investigated how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms.This work has:

Examined, compared and evaluated 22 programming languages and environments for parallel computing on multi-core CPUs and GPUs.

Chosen to use OpenMP as the standard for multi-core CPU programming and OpenCL for GPU programming.

Re-implemented a number of standard and well-known algorithms in Computer Vision using both standards.

Tested the performance of the implemented parallel algorithms and compared the performance to the sequential implementations of the commercially available software package VisionLab.

Evaluated the test results with a view to assessing:

Appropriateness of multi-core CPU and GPU architectures in Computer Vision.

Benefits and costs of parallel approaches to implementation of Computer Vision algorithms.

Using OpenMP it was demonstrated that many algorithms of a library could be parallelized in an economical way and that adequate speedups were achieved on two multi-core CPU platforms. With a considerable amount of extra effort, OpenCL was used to achieve much higher speedups for specific algorithms on dedicated GPUs.

At the end of the project, the choice of standards was re-evaluated including newly emerged ones. Recommendations are given for using standards in the future, and for future research and development.

Algorithmic improvements are suggested for Convolution and Connect Component Labelling.

Your feedback and/or questions are welcome.

If you put comments here, I’ll make sure Jaap van de Loosdrecht will get to know and answer your questions on the subjects discussed in his thesis.

All the members of the OpenCL working group 2013

Posted by Vincent Hindriksen on 7 November 2013

In the below list are the members of the OpenCL workgroup as of November 2013.

We can expect small changes each year, but this is close to the actual state. I need the rest of Q4 to finalise all the info – any help is appreciated.

This list has also been compiled in 2010, and you can see several differences. If the company has an SDK available, there is a link. That is a whole difference with the last list – this one is much more concrete. Continue reading “All the members of the OpenCL working group 2013” →

Reducing downtime with OpenCL… Ever thought of that?

Posted by Vincent Hindriksen on 3 November 2013

Something that creates extra value for Open CL is the flexibility with which it runs on an important variety of hardware. A famous strategy is running the code on CPUs to find data-races and debug the code more easily. Another is to develop on GPUs and port to FPGAs to reduce the development-cycles.

But there’s one, quite important, often forgotten: replacement of faulty hardware. You can blame the supplier, or even Murphy if you want, but what is almost certain is that there’s a high chance of facing downtime precisely when the hardware cannot be replaced right-away.

Fail to plan is planning to fail

To limit downtime, there are a few options:

Have a good SLA in place for 24/7 hardware-replacement.
Have spare-hardware in stock.
Have over-capacity on your compute-servers.

But the problem is that all three are expensive in some form if you’re not flexible enough. If you use professional accelerators like Intel XeonPhi, NVidia Tesla or AMD FirePro, you risk having unexpected stock shortage at your supplier.

With OpenCL the hardware can be replaced by any accelerator, whereas with vendor-specific solutions this is not possible.

Flexibility by OpenCL

I’d like to share with you one example how to introduce flexibility in your hardware-management, but there are various others which are more tailored to your requirements.

To detect faulty hardware, you can think of a server with three GPUs and let selected jobs be run by all three – any hardware-problem will be detected and pin-pointed. Administrating which hardware has done which job completes the mechanism. Exactly this can be used to replace faulty hardware with any accelerator: let the replacement-accelerator run the same jobs as the other two as an acceptance-test.

If you need your software to be optimised for several accelerators, you’re in the right place. We can help you with both machine and hand optimizations. That’s a plan that cannot fail!

Products using OpenCL on ARM MALI are coming

Posted by Vincent Hindriksen on 26 October 2013

The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming” →

Basic Concepts: Writing OpenCL code for single and double precision

Posted by Vincent Hindriksen on 17 October 2013

Support for double precision floating-point type double in OpenCL kernels requires an extension. AMD provides cl_khr_fp64 for newer high-edn hardware, but also a non-fully compliant cl_amd_fp64 extension for other hardware. NVIDIA and Intel support the cl_khr_fp64, so no exceptions need to be made for those drivers.

The code you see bellow these lines is based on a page you can find on Bealto and it was written by Eric Bainville. I added extra typedefs, removed a constant and added DOUBLE_SUPPORT_AVAILABLE for easier fallback.

#if CONFIG_USE_DOUBLE

#if defined(cl_khr_fp64)  // Khronos extension available?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#elif defined(cl_amd_fp64)  // AMD extension available?
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#define DOUBLE_SUPPORT_AVAILABLE
#endif

#endif // CONFIG_USE_DOUBLE

#if defined(DOUBLE_SUPPORT_AVAILABLE)

// double
typedef double real_t;
typedef double2 real2_t;
typedef double3 real3_t;
typedef double4 real4_t;
typedef double8 real8_t;
typedef double16 real16_t;
#define PI 3.14159265358979323846

#else

// float
typedef float real_t;
typedef float2 real2_t;
typedef float3 real3_t;
typedef float4 real4_t;
typedef float8 real8_t;
typedef float16 real16_t;
#define PI 3.14159265359f

#endif

A macro is defined by the OpenCL C compiler for each available extension, which is cl_khr_fp64 in this example. This macro can be tested to enable the extension with #pragma OPENCL EXTENSION cl_khr_fp64 : enable.

Now, you need to use the defined constant(s) and real_t, real2_t types instead of float or double. The definition of CONFIG_USE_DOUBLE is passed as compilation option to clBuildProgram to make the switch between double and single precision. If there is no double-support, it falls back to single precision.

Enjoyed this post? Share it!

Basic Concepts: out of resources with clEnqueueReadBuffer

Posted by Vincent Hindriksen on 15 October 2013

“Oops! The best way to learn, when you love trial-on-error”™

In the series “Basic Concepts” various basics of GPGPU and OpenCL are discussed. This time we go into a typical one: when an error does not imply the actual problem. It is therefore good to have an overview of all errors with their descriptions.

When you get an out-of-resources error or when you get a crash when using clEnqueReadBuffer, you are sort of left in the dark. What does it mean? And how can you solve it?

Typical: one driver crashes/segfaults and another one gives this error.

Officially the error is defined as:

CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.

Which means that there can more reasons than the device being out of resources. A better name would have been CL_RESOURCE_ALLOCATION_ERROR. It can be thrown by various functions, but we focus on this one function. It cannot by thrown by clEnqueWriteBuffer, as that depends on the limits of the host.

Finding out the cause

The oldest trick of ‘m all: try to use the CPU and check what the error is then. CPUs are great to detect data-races (correct on CPU, not on GPU) and CPUs are a bit more stable when you have buggy code plus have more RAM. Be sure to install both Intel’s and AMD’s drivers.

Calling clFinish at each line, helps you pinpoint the actual line it happens or to get an error instead of a crash.

Then you have the following options:

9 out of 10 times you have a pointer problem at the host or are writing out of bounds. So you try to write to an illegal memory location, or try to cram in an 35×35 float* into 10x10x10 float* space (buffer-overflow). Double check the host memory-sizes, and if the host-pointers are correct.
You read out of bounds on the device. Double-check the used memory-sizes.
You might have hit a limit of the driver, such as the 5s timeout if the NVidia card is also being used as a display. Rule out you have used up all memory by using both smaller and larger(!) objects. Also note down memory object sizes over time. Be sure you clean up non-used objects. Fragmentation of device-memory can also be the problem it eventually goes wrong.

The last one I have not encountered myself, but found on the Nvidia forums. I recently had this error (type 1), because I had introduced clear naming in the code I was working on. When I introduced the standard ‘h_‘ and ‘d_‘ prefixes for all variables, I immediately found the cause.

Hope it has helped you understand the resource allocation error. If you found other reasons, please share via the comments and I’ll add it. If you have requests what to discuss in this series, let me know via Twitter or the comments.

Category: Technical