Customer: “So you also do full projects?”

Posted by Vincent Hindriksen on 15 March 2017

One of these moments when you find out that the company is not seen as I want it to be seen. Compared to generic software engineering companies, we have the advantage of creating software that is capable of processing more data.

For years we have promoted this unique advantage, for which we use OpenCL, CUDA, HIP and several other languages. Always having full projects in mind, but unfortunately not clearly communicating this.

During discussions with several existing and new customers, it became suddenly clear that we are seen as a company that fixes code, not one that builds the full code.

It became most clear that when was suggested to let us collaborate with another party, where our role would be to make sure they would not make mistakes regarding performance and code-quality.

Customer:   You can work with a team we hired before.

Us:             We also do full projects.

Customer:   Really?

This would mean we would be the seniors in the group, but not own the project – a suboptimal situation, as important design decisions could be ignored. Continue reading “Customer: “So you also do full projects?”” →

Cloud

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Posted by Vincent Hindriksen on 15 August 2019

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users world-wide in universities, research groups and industry.

Below you’ll find the benchmark results of our acceleration work. You can also test out yourself, as the code is public. In the readme-file you can learn more about the idea of the software. Lots of background information is described in these two papers:

We chose PaSWAS because we really like bio-informatics and computational chemistry – the science is interesting, the problems are complex and the potential GPU-speedup is real. Other examples of such software we worked on are GROMACS and TeraChem.

Continue reading →

Do your (X86) CPU and GPU support OpenCL?

Posted by Vincent Hindriksen on 29 December 2011 with 26 Comments

Does your computer have OpenCL-capable hardware? Read on and find out if your computer is compatible…

If you want to know what other non-PC hardware (phones, tablets, FPGAs, DSPs, etc) is running OpenCL, see the OpenCL SDK page.

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The minimum requirement for OpenCL-on-a-CPU is SSE 4.2, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all the available devices, so you then need to install drivers for each device. CPU-drivers are often included in the GPU-drivers.

Read on to find out exactly which processors are supported.

Continue reading “Do your (X86) CPU and GPU support OpenCL?” →

AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Posted by Vincent Hindriksen on 13 September 2012 with 8 Comments

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000” →

Download all OpenCL header files and build your own OpenCL library

Posted by Vincent Hindriksen on 3 June 2016

OpenCL header files

When you develop professional software, it is a best practise to have external header files fixed, to have versioning under full control. This way you don’t get the surprises when your colleague has another OpenCL SDK installed. Luckily the Khronos Group has put all version of the OpenCL header files on Github, so you can easily download the targeted OpenCL version.

Download a zip of the header files here:

If you found problems in one of these, you can directly communicate with the working group by submitting an issue on Github.

OpenCL.lib / libOpenCL.so

But wait, there is more!

You can build your own ICD, as the sources are open (licence). OpenCL version 2.1 is implemented, but it is fully backwards compatible to OpenCL 1.0. You can assume that the vendors use this code for their own, so you can safely use this code in your project.

Get the project from Github.

AMD positions FirePro S10000 against both TESLA K10 (and K20)

Posted by Vincent Hindriksen on 15 November 2012

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)” →

How expensive is an operation on a CPU?

Posted by Vincent Hindriksen on 16 July 2012 with 13 Comments

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about ~~Perl~~ LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

This article is just to help you on your way, and most of all: to make you aware. Note it is incomplete and probably not valid for all kinds of CPUs.

Continue reading “How expensive is an operation on a CPU?” →

Professional and Consumer Media Software using OpenCL

Posted by Vincent Hindriksen on 28 December 2013

More and more professional media software now has support for OpenCL. It starts to be a race where you cannot stay behind. If the competitor runs more than twice as fast on the same hardware, then you just can’t say “Sorry, you should buy NVIDIA hardware”. I expected this to happen, but could not tell in what industry they would run fastest. Seems it is fluid dynamics, video-editors and photo-editors.

AMD and Intel mostly have been selected as collaboration partners. Apple has been a main drive, especially with the introduction of their new MAC Pro with two high-end AMD FirePro GPUs.

Sony Catalyst Family

Sony released three new software packages to support video professionals in pre- and post-production.

This new family of products, Catalyst Browse (media management), Catalyst Prepare (video preproduction assistant) and Catalyst Edit (4K and Sony RAW video editing) has OpenCL support from the start.

Colorfront Express Dailies and On-Set Dailies

This software is an on-set dailies processing system (playback and sync, QC, colour grading, audio and metadata management).

The 2014 versions have OpenCL support in their transcoder plugin, Transkoder.

RED REDCINE-X PRO

REDCINE-X is a coloring toolset, integrated timeline, and post effects collection in a professional, flexible environment for your 4K or 5K .R3D files. RED has added support for OpenCL in build 22.

The Foundry Nuke Blink framework

As presented on GPUconf, The Foundry has opened their framework for running OpenCL kernels. It creates OpenCL-kernels (optimised for AMD or NVIDIA) from C++ Blink kernels.

NUKE studio is a node-based VFX, editorial and finishing studio. As with most products on this page, look for the “reel” to get a nice demo of its capabilities.

Magix Hybrid Video Engine

Video Deluxe and Movie Edit both have OpenCL support since 2012, thanks to the new shared video engine.

http://www.youtube.com/watch?v=27M7vJIYR3c

Adobe CS6 creative suite

Adobe has entered the OpenCL market publicly with . With Premiere Pro (video editing) and Photoshop (photo-editing) two main products with advanced GPU-acceleration via OpenCL.

http://www.youtube.com/watch?v=F3LwNT1QUPQ

Video on GPU-effects on Premiere Pro CS5.

FAQ on GPU-acceleration on Photoshop CS6.

Sony Vegas Pro

Vegas Pro is a video editing software package for non-linear editing systems, and has OpenCL support since version 10d. Also in the consumer version (Sony Movie Studio) there is OpenCL-support.

RealFlow Hybrido2 engine

RealFlow is fluid dynamics software and its new engine Hybrido2 has support for OpenCL since this year. And you just have to love their commercial videos.

http://www.youtube.com/watch?v=Fj0err96BbQ

Autodesk Maya

Maya is a toolsets to help create and maintain the modern, open pipelines you need to address today’s challenging 3D animation, visual effects, game development and post-production projects. Since the 2013 version it is accelerated for physics simulations via Bullet and OpenCL.

http://www.youtube.com/watch?v=36bIdH6EBkM

ArcSoft SimHD and Sim3D engine

ArcSoft media-engines SimHD and Sim3D have OpenCL support since several years and are used in several of their products.

http://www.youtube.com/watch?v=tvXyLKEeX2I

BlackMagic Design

BMD has two suites which use OpenCL, Resolve and Fusion. DaVinci was acquired in 2009 and EyeOn in 2014.

(DaVinci) Resolve

Resolve has real-time colour correction thanks to OpenCL.

http://www.youtube.com/watch?v=lfrudtCTwv0

(Eyeon) Fusion

Fusion is an image compositing software program created by eyeon Software Inc. It is typically used to create visual effects and digital compositing for film, HD and commercials.

It uses OpenCL since version 6.

Roxio Creator Suite

Roxio uses OpenCL for accelerated rendering in their suite. They were one of the first to implement OpenCL – I think already in 2010, before OpenCL was even cool.

Unluckily they don’t have much information – just a mention that they have support.

Apple Final Cut Pro and iMovie

Apple has support in Final Cut Pro X, Motion 5 and Compressor 4.

Also iMovie works a lot faster when you have an OpenCL capable MAC.

Blender Cycles & Bullet

You cannot find any demonstration of new video hardware without Big Bucks Bunny, the short CG movie created with Blender.

It uses OpenCL in two parts: physics simulations (Bullet) and compositor (Cycles).

http://www.youtube.com/watch?v=QbzE8jOO7_0

Side Effects Houdini

Houdini is a procedural node based 3D animation and visual effects tools for film, broadcast, entertainment and visualisation production.

http://vimeo.com/46444204

DigitalFilmTools

There is support for OpenCL in zMatte, Composite Suite Pro and Film Stocks since Q4 2013.

zMatte is a keyer for blue and green screen composites. Composite Suite Pro is a collection of visual effects plug-ins. Film Stocks simulates color and black and white still photographic film stocks, motion picture films stocks and historical photographic processes.

OTOY OctaneRender 3

OctaneRender is a GPU-based, real-time 3D, unbiased rendering application. In March 2015 OTOY announced OctaneRender 3, which has full OpenCL support:

OpenCL support: OctaneRender 3 will support the broadest range of processors possible using OpenCL to run on Intel CPUs with support for out-of-core geometry, OpenCL FPGAs and ASICs, and AMD GPUs.

Below is a reel of OcateRender 2 with CUDA. According to OTOY the performance on AMD and NVidia is comparable.

https://www.youtube.com/watch?v=gLSBVt0VQSI

SAM Alchemist XF

Alchemist XF supports format and framerate conversion from SD up to 4K for a wide variety of file formats at high speed.

More?

There is a lot more OpenCL-powered software coming up rapidly (we hear things). But we also missed (or accidentally forgot) software. Please help making this list complete and send us an email.

5 types of loops you should avoid

Posted by Vincent Hindriksen on 8 April 2012 with 4 Comments

In “Separation of compute, control and transfer” I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.

The more you think about it, the more you find that a loop is not a loop.

Continue reading “5 types of loops you should avoid” →

AMD OpenCL Programming Guide August 2013 is out!

Posted by Vincent Hindriksen on 6 August 2013

AMD has just released an update to their AMD programming guide.

~~Download the guide (PDF) August version~~

Download the guide (PDF) November version

Download TOC (PDF)

For more optimisation guides, see the tutorials page of the knowledge base.

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview
1.1.1 Synchronization

1.2 Hardware Overview for Southern Islands Devices

1.3 Hardware Overview for Evergreen and Northern Islands Devices

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL Continue reading “AMD OpenCL Programming Guide August 2013 is out!” →

OpenCL at SC13

Posted by Vincent Hindriksen on 15 November 2013

Unluckily I am not at SC13, so I’ll enjoy from a distance. Luckily I don’t miss one of the most beautiful 15km runs in the Netherlands. When there is more news, I’ll add to this post – below is mostly taken from the Khronos website and the SC2013 website.

OpenCL Booth

Meet members of the OpenCL workgroup in Booth #4137 to get hot news from the OpenCL experts and an OpenCL reference card. Also learn about next year’s plans for IWOCL (International Workshop on OpenCL). Be sure not to miss the BOF on OpenCL 2.0 (in bold in the schedule).

Schedule

Sunday, 17 November
8:30 – 17:00	Tutorials	Structured Parallel Programming with Patterns	Michael McCool, James Reinders, Arch Robison, Michael Hebenstreit	302
8:30 – 17:00	Tutorials	OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools	Luiz DeRose, Alistair Hart, Heidi Poxon, James Beyer	401
Monday, 18 November
8:30 – 17:00	Tutorials	OpenCL: A Hands-On Introduction	Tim Mattson, Alice Koniges, Simon McIntosh-Smith	403
Tuesday, 19 November
11:00 & 15:00	ACM Student Research Competition Poster Reception	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
17:15 – 19:00	Short presentation	local_malloc: malloc for OpenCL __local memory [Poster]	John Kloosterman	Mile High Pre-Function
Wednesday, 20 November
11:00 & 15:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00	Case Study	Accelerating Full Waveform Inversion via OpenCL on AMD GPUs	AcceleWare	AMD booth
17:30 – 19:00	BOF	OpenCL: Version 2.0 and Beyond	Tim Mattson, Ben Bergen, Simon McIntosh-Smith	405/406/407
Thursday, 21 November
11:30 – 12:00	Exhibitor Forum	OpenCL 2.0: Unlocking the Power of Your Heterogeneous Platform	Tim Mattson	501/502
10:30 – 11:00	Papers	General Transformations for GPU Execution of Tree Traversals	Michael Goldfarb, Youngjoon Jo, Milind Kulkarni	205/207
11:00 – 11:30	Papers	A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening	Alberto Magni, Christophe Dubach, Michael F.P. O’Boyle	205/207
11:30 – 12:00	Papers	Semi-Automatic Restructuring of Offloadable Tasks for Many-Core Accelerators	Nishkam Ravi, Yi Yang, Tao Bao, Srimat Chakradhar	205/207
11:30 – 12:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00 – 16:30	Paper	Accelerating Sparse Matrix-Vector Multiplication on GPUs using Bit-Representation-Optimized Schemes	Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, Weng-Fai Wong	401/402/403

There are other interesting SC13-events around OpenCL, so be sure to check the schedule carefully.

Khronos Members Exhibiting at SC13

Complete floor plan is available here.

Below links are updated from the company-homepages to special SC13 landing-pages.

Altera Corporation – Booth 4332.
- BittWare will be showcasing their TeraBox High Performance Reconfigurable Computing Platform, which runs OpenCL.
- Acceleware. See schedule above.
AMD – Booth 1113. See this list for a schedule.
ARM – Booth 3141.
- OpenCL on MALI demos at their booth
- AccelerEyes (#310) showcases ArrayFire (with OpenCL-on-ARM backend) running on Mali T604.
- Many more – just look for it.
Khronos OpenCL – Booth 4137.
IBM – Booth 126, 2713. OpenCL on IBM PowerLinux 7R2 and IBM Flex System. Also collaboration with Altera.
Intel – Booth 2501, 2701. OpenCL on their CPUs and GPUs.
NEC Corporation – 3109. Vector supercomputer – Khronos probably hints to OpenCL running on this machine – see for yourself.
NVIDIA – Booth 613. Have fun hearing them say: “Do you use CUDA or are you locked-in to OpenCL?” and variations on this.
Texas Instruments – Booth 3725. OpenCL on DSP demo.
XilinX – To schedule a private appointment (for an OpenCL-demo) visit Xilinx at the Convey booth (#3547) or the Alpha Data booth (#4237).

The floorplan can be downloaded here or here (mirrored on 15-Nov).

At SC13 and saw great OpenCL demos or news?

Share this info and photos in the comments, for others to pick up.

All the members of the OpenCL working group 2010

Posted by Vincent Hindriksen on 11 April 2010

(If you’re searching for companies who offer OpenCL-products and services, please visit OpenCL:Pro)

You probably have heard AMD is on the OpenCL working group of Khronos; but there are many more and they possibly all have plans to use it. Here is an overview, so you can make your own conclusions about the future that lays ahead. Is your company on “the list”?

We’re specially interested in the less known companies, so most information is about the companies you and us possibly have not heard from before. We’ve made assumptions what the companies use OpenCL for, so we need your feedback if you think we’re wrong! Most of these companies have not openly written about their (future) accelerated products, so we had to make those guesses.

Disclaimer: All brand and product names are or may be trademarks of, and are used to identify products or services of, their respective owners.

Last updated 6-Oct-2010.

GPU Manufacturers

GPUs being the first products targeted by OpenCL, we blast away with a list of CPU-manufacturers. You might see some unknown companies and now know which companies missed the train; it is pretty clear why GPU-manufacturers have interest in OpenCL.
We skip the companies who have a GPU-stack built upon ARM-techology and only focus on pure GPU-manufacturers in this category.

AMD

We’ve already discussed the biggest fan of OpenCL several times. While having better GPU-cards than NVIDIA (arguable per quarter of the year), they put their bets completely on OpenCL. They even get credits like “AMD’s OpenCL” when compared with NVIDIA’s CUDA.

The end of 2010, beginning of 2011 they will ship their Fusion-product having a CPU and GPU on one chip. The first Fusion-chips will not have a high-end GPU because of heating problems, is told to PC-store employees.

NVIDIA

AMD’s biggest competitor with the very well marketed similar product CUDA. Currently they have the most specialised products in market for servers. While they put more energy in their own technology CUDA, it must be said that they have adopted OpenCL more than any other hardware vendor.

Intel

The biggest part of the CPU-market is for Intel en guess once, who has the biggest GPU-market in hands? Correct: onboard-GPUs are Intel’s speciality, but their high-end GPU Larrabee might once see the market. Just like AMD they have the technology (and products) to have an integrated CPU/GPU which will be very interesting for the upcoming OpenCL-market.

They are openly interested in OpenCL. Here is a nice interview which explains how a CPU-designer looks at GPU-designs.

Vivante

Vivante manufactures GPU-chips. They claim their OpenGL ES 2.0-compliant silicon footprint is the smallest on the market. There is a lot of talk about OpenGL Shader Language (OpenCL’s grandpa), for which their products are very well suited for. Quote: “The recent trend in graphics hardware has been to replace fixed functionality with programmability in areas that have grown exceedingly complex, such as vertex processing and fragment processing. The OpenGL® Shading Language was designed to allow application programmers to express the processing that occurs at those programmable points of the OpenGL pipeline. Independently compilable units written in this language are called shaders. A program is a set of shaders that are compiled and linked together.”

Takumi

Japanese corporation Takumi manufactures the GSHARK, a 2D/3D hardware accelerator. The focus is on shaders, like Vivante.

Imagination Technologies (ImTech)

From their homepage: >>POWERVR enables a powerful and flexible solution for all forms of multimedia processing, including 3D/2D/vector graphics and general purpose processing (GP-GPU) including image processing.

POWERVR’s unique tile-based, deferred rendering/shading architecture allows a very small area of a die to deliver higher performance and image quality at lower power consumption than all competing technologies. All major APIs are supported including OpenGL ES 2.0/1.1, OpenVG 1.1, OpenGL 2.0/3.0 and DirectX9/10.1 and OpenCL.<<

Currently all ARM-based OpenCL-capable devices have POWERVR-technology.

Toshiba

Like other huge Japanese everything-factories, you don’t know what else they make. Besides rice cookers they also make multimedia chips.

S3

Once they were big in the consumer-market of graphics cards, but S3 still exists as a more business-oriented manufacturer of graphics products.

CPU Manufacturers

We miss the Power Architecture, but IBM and Freescale are members of this group.

Intel

While AMD tries to make OpenCL available for the CPU, we have not heard of a similar product from Intel yet. They see a future for multi-core CPUs, as seen in these slides.

ARM

Most known for its same-named low-power processor, not supported by MS Windows. You can read below how many companies have a license on their technology. Together with POWERVR-technology they power all the embedded OpenCL devices of the coming year.

IBM

Currently they are most known for their Cell-processor (co-developed with Toshiba and Sony) and have a license to build PowerArchitecture-CPUs. The Cell has full OpenCL-support as first non-GPU. Older types of PS3s (without the latest firmware) ad IBM’s servers can use the power of OpenCL. End of June 2010 Khronos conformed their “Development Kit for Linux” for Power VMX and PowerXCell8i processors.

Freescale

Once a Motorola-division, they make lots of different CPUs. Besides ARM- and PowerArchitecure-based ones, they also have it’s own ‘Coldfire’. We cannot say for which architecture they are interested in OpenCL, but we really would like to hear something from them since they can open many markets for OpenCL.

Systems on a Chip (SoC)

While it is cool to have a GPU-card in your pc, more and more the Graphics-functionality is integrated onto a CPU. Especially in the mobile/embedded/gadget-market you’ll find such System-on-a-Chip solutions, which are actually all ARM- or PowerArchitecture based.

3DLABS (ZiiLabs)

Creators of embedded hardware with focus on handhelds. They have partners of Khronos for a long time, having built the first merchant OpenGL GPU, the GLINT 300SX. They have just released a multimedia-processor, which is an ARM-processor with pretty interesting graphic capabilities.

They have an “early access program for OpenCL” for their ZMS product line.

Movidia

On their Technology overview-page they imply they have flexible accelerators in their designs, which *could* in the future be controlled by OpenCL-kernels. They manufacture mobile GPUs-plus-loads-of-extras which are quite impressive.

Texas Instruments

Besides ARM-based processors they also have DSPs. We watch them, for which product they have OpenCL in mind.

Qualcomm

They might be most famous for their ARM-based Snapdragon-chipset. They have much more products, but we think they start with Snapdragon before building OpenCL in other products.

Apple

The Apple A4 powers their new products, the iPad. It becomes more and more clear Apple has really learned that you cannot rely on one supplier, after waiting for IBM’s G6. With OpenCL Apple can now make software that works on ARM, all kind of GPUs and CPUs.

Samsung

They make anything that is fed by batteries, so for that reason they should be in the “other” category: mobile phones, mp3-players, photo-cameras, camcorders, laptops, TVs, DVD-players and Bluray-players. All products where OpenCL can wield.

A good reason to make their own semi-conductors, ARM-based.

In the beginning of June 2010 they have launched their own Linux-based OS for mobiles: Bada.

Broadcom

Manufactures networking and communications ICs for data, voice, and video applications. They could use OpenCL for their mobile multimedia processors.

Seaweed

Since September acquired by Presagis. We cannot be sure they continue the OpenCL-business of Seaweed, but at least GPGPU is mentioned once.

Presagis is “the worldwide leader in embedded graphics solutions for mission-critical display applications. The company has provided human-machine interface (HMI) graphical modeling tools, drivers and devices for embedded systems for over 20 years. Presagis pioneered both the prototyping of display graphics and automatic code generation for embedded systems in the 1990s. Since then, code generated by its flagship HMI modeling products has been deployed to hundreds of aircraft worldwide and its software has been certified on over 30 major aircraft programs worldwide. Presagis is your trusted partner for reliable, high-performance embedded graphics products and services.”

ST Microelectronics

ST has many products: “Singapore Technologies Electronics is a leader in ICT. It has main businesses in Enterprise, Satellite Communications and Interactive Digital Media. It is divided into several Strategic Business Units consisting of Info-Comms, Info-Software, Training and Simulation, Electro-Optics, Large Scale Group, Satcom & Sensor Systems.”

We think they’ve shown interest for OpenCL for use with their Imaging processors. Together with Ericsson they have a joint-venture in de mobile market, ST-Ericsson.

Handheld Manufacturers

While most companies will find it hard to make OpenCL-business in the consumer-market, consumer-products of other companies make sales a little bit warmer.

Apple

At least the iPad and iPhone have hardware-capabilities of running OpenCL. It is expected that it will come available in the next major release of the iPhone-OS, iOS 4. We’re waiting for more news.

Nokia

The largest manufacturer of mobile phones from Finland has a lot of technology. Besides smartphones, possibly a netbook (in cooperation with Intel) they also have Symbian and the QT-library. Since a while QT has support for OpenCL. We think the support of OpenCL in programming languages (in a more high-level way) is very important. See these slides to read some insights of the company.

Motorola

They have consumer products like mobile phones and business products like networking. It is not clear where they are going to use OpenCL for, since they mostly use other companies’ technologies.

Super-computers

While OpenCL can revive old computers once upgraded with a new GPU, imagine what they can do with Super-computers.

IBM

IBM builds super-computers based on different technologies. With OpenCL-support for their Power VMX and PowerXCell8i processors, it is already possible to use OpenCL with IBM-hardware.

Fujitsu

They have many products, but they also make super-computers which use GPGPU.

Los Alamos National Laboratory

They build super-computers and really can use the extra power.

A job-post talks about heterogeneous architectures and OpenCL.

Petapath

Petapath, founded in 2008, focuses on delivering innovative hardware and software solutions into the high performance computing (HPC) and embedded markets. As can be seen from their homepage they build grids.

NVIDIA

As a newcomer in the super-computer business, they do very well having helped to build the #2 HPC. Many clusters are upgraded with their streaming-processors.

Other Hardware

We don’t know what they are actually doing with the technology, purely because they are to big to make assumptions.

GE

US-based electronics-giant General Electronics builds everything there is, fed by electricity and now also GPGPU-powered solutions as can be found on their GPGPU-page. They probably switched to CUDA.

ST-Ericsson

Ericsson together with ST they have a joint-venture in de mobile market, ST-Ericsson. Ericssson is big in (mobile) networking. It also builds mobile phones with Sony. It is unclear what the joint-venture wants to do with the technology, but it must be mobile.

Software Developers

While OpenCL is very close to hardware, we have to talk software too. Did anybody say there is a strict line between hardware and software?

Graphic Remedy

Builders of debugging software. You will hear later more from us about this company soon. See something about debugging in this presentation.

RapidMind

RapidMind provided a software product that aims to make it simpler for software developers to target multi-core processors and accelerators (GPUs). It was acquired by Intel in august 2009.

HI

Japanese corporation HI has a product MascotCapsule, which is a real-time 3D rendering engine (native library) that runs on embedded devices. We see names of other companies, except SMedia. If you’re not familiar with mobile GPUs, here you have a list.

This is another big hint, OpenCL will have a big future on mobile devices.

MascotCapsule V4 product specification

Operating environment	CPU	ARM: ARM9 or above Freescale: i.MX Series Marvell: XScale Qualcomm: MSM6280/6550/7200/7500 etc. Renesas Technology: SH-Mobile etc. Texas Instruments: OMAP 32-bit 150 MHz or above is recommended (Capable of running without a floating-point hardware)
	Code size	Approx. 200 KB
	Engine work area	2 MB or more is recommended, including data load area Note: The actual required work area varies depending on the content
3D hardware accelerator		ATI: Imageon Imagination Technologies: PowerVR MBX/MBX Lite/SGX NVIDIA: GoForce SMedia: Glamo TAKUMI: GSHARK Toshiba: T4G/T5G Other OpenGL ES compliant 3D accelerators
OS/platforms		BREW, iPhone, iPod touch, ITRON, Java, Linux, Symbian OS, Windows CE, Windows Mobile
3D authoring tools		3ds Max 9.0/2008/2009/2010 Maya 8.5/2008/2009/2010 LightWave3D 7.5 or later SOFTIMAGE\|XSI 5.x/6.x/7.0

Codeplay

They are most famous for their compilers for the Playstation. They also make code-analysis software.

QNX

From their homepage: “Middleware, development tools, realtime operating systemsoftware and services for superior embedded design”. Their real-time OS in all kinds of embedded products and they might want to see ways to support specialised low-power chips.

RIM acquired QNX in april 2010.

Fixstars

Newcomer in the list 2010. Famous for their PS3-Linux and for their OpenCL-book. They also have FOXC, Fixstars OpenCL Cross Compiler. They have written one of the few books for OpenCL.

Kestrel Institute

http://www.kestrel.edu/ does not show anything GPGPU. We’ll probably hear from them when the next version of their Specware-product is finished.

Game Designers

Physics-calculations and AI are too demanding to do on a CPU. The game-industry keeps pushing the GPU-industry, but now on a different way than in the 90’s.

Electronic Arts

This game-studio builds loads and loads of games with impressive AI. See these slides to see what EA thinks GPGPU can do.

Activision Blizzard

Yes, they are one company now, so now they are together famous for best-selling hit “World of Warcraft”. Currently not much is known where they use OpenCL for, but probably the same as EA.

Thank you for your interest in this article

If you know more about OpenCL at these companies or job-posts, please let us know via comment or via e-mail.

We’ve made some assumptions about what these companies use OpenCL for – we need your feedback!

NVIDIA’s answer to FirePro S9000: the TESLA K20

Posted by Vincent Hindriksen on 12 November 2012 with 5 Comments

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Continue reading “NVIDIA’s answer to FirePro S9000: the TESLA K20” →

IWOCL 2019

Posted by Vincent Hindriksen on 6 April 2019

On Monday May 13, 2019 at 09:30 the latest edition of IWOCL starts, not taking into account any pre-events that might be spontaneously organized. This is the biggest OpenCL-focused event that discusses everything that would make any GPGPU-programmer, DSP-programmer and FPGA-programmer enthusiastic.

What’s new since last year, is that it’s actually also more interesting place for CUDA-developers who like to learn and discuss new GPU-programming techniques. This is because Nvidia’s GTC has moved more to AI, where it used to be mostly GPGPU for years.

Since it’s now the last week of the early-bird pricing, it’s a good time to make you think about buying your ticket and book the trip.

Continue reading →

Molybdenite and graphene to the helping hand?

Posted by Vincent Hindriksen on 26 April 2011

The Last Nimzy — The rabbit in “The Last Mimzy” was very special. What material was it made of?

You might have read about Molybdenite a few months ago. It is more efficient than Graphene which is in turn more efficient than good old Silicon, most notable energy-wise. Magazine ‘Nature’ had an article on it, which is summarised by Psychorg, so check it out. The claim it is 100 000 times more efficient than Silicon (and more efficient than the already very promising Graphene). This fan-free Silicon-replacer would be a major disaster for the cooling-industry!

But what would change for us? We are now on the edge to move to ARM (started by the smartphone- and tablet-industry), but is al this needed if the energy-costs drop to prices comparable to the costs to keep ice-cream cold on the North-Pole (20 years ago). This technique would give huge potential to Fusion-chips which now have a long way to go, to solve the heat-problem. But since it would take several years (and thus decades in hi-tech years) to get these chips on the market, no assumptions for market-share can be made based on what will happen in a few years.

Low-power ARM and Molybdenite X86

So this is European ARM (and licensees around the world) vs US Intel and AMD. The sarcastic joke among me and a few friends make, is that the fight of the past 20, 30 years between the economic US and EU is actually about who has the money to hire the most Asians, to develop the revolutionising devices. But as long as the US and EU have the feeling we are actually the equation of the competition as we are a massive 12% of the world-population, I won’t be behind the facts too much.

Since batteries don’t evolve as fast as processors, the power-problem needed to get slashed differently. A mayor reason for choosing ARM is that it uses less energy than X86, just like LCD/TFT is replaced by e-ink and organic LEDs and memory is non-volatile in portable devices.

In case we get a big reduction for CPU and memory, then the efficiency of the architecture is less of a problem. So then Intel and AMD can re-enter the market again, but then with much more powerful devices. Until then ARM-licensees like NVIDIA and ImTec have a better market if it comes to near-future devices. As I expected more tablet-manufacturers come up with docking-stations to replace the PC with a tablet. AMD and Intel have to keep surprising (and probably protect their market) the coming years to avoid losing from ARM. In other words: the coming years will be exciting how the consumer-market looks like and which companies deal in it. When thinking about these years, keep in mind what Windows XP has thought us: computers are fast enough for what average Joe wants to do with it. Hey, I use my laptop for OpenCL and the big screen, for the rest I use my mobile phone.

Hybrid chips

While I did not see it as a serious problem last year, the heat-problem for a GPU+CPU on one chip is quite a challenge. Waiting for the Molybdrenite or Graphene chips to mature will be like digging your own grave. Each step forward will result in two new products: one which is more power and/or heat efficient, and one which is more powerful. Since the competition from ARM-companies is heavy, the chances that the focus will be on more powerful Hybrid CPUs is bigger. As I stated above the losses are in the low-power area. Intel and AMD are very aware of this challenge.

Have you checked the differences between DirectX 10 and 11 games? Just check the discussions on the growing side of not needing to support DirectX 11, because 10 is good enough. Also here, the demand is higher to have the same graphics-quality for less money on more portable devices. Hybrid CPUs will eat the GPU-market for sure.

ARM-processors are hybrid processors. That’s all I tell, so you can -in combination with all stated above- formulate your own conclusions. I was very surprised NVIDIA started targeting ARM with their high-end GPUs, but was this a real bad idea?

Device vs Data-centre

Reduction of energy-costs for processors will reduce the head-less servers in the data-centre enormously. Internet costs loads of energy, both the transport and the servers – this will reduce the server-part of energy-consumption-sum with quite some factors. All positive news.

But if it all this becomes true, that chips don’t use much energy anymore and actually mobile internet and other radios take the most, what will happen to the cloud? Will you upload your video to get it processed or put your mobile in the sun to charge it while waiting a shorter period?

Current developments, future needs

We need arithmetic, media-processing and input/output; we all have that. We need long battery-life, a good screen and a fast way to input our data and commands; we get more of that each day. But heat-production is Silicon limits a lot, so we get the perfect electronic device the moment we can replace Silicon. Getting rid of the heat could give us square chips, with challenges like reinventing the socket and multi-multi-layerness.

So the question to you: is in The Last Nimzy sequel (you know, the movie with the molybdenite rabbit) a logo of Intel, AMD, ARM or another company found?

Imagination

Imagination is best known for their GPUs in Apples iDevices. They have support for:

OpenCL
Apple Metal
Vulkan
OpenGL
Google RenderScript

Imagination is a strong supporter of Khronos APIs OpenCL and Vulkan.

OpenCL

Currently there are two PowerVR GPU architectures with OpenCL support: the 5 series (scroll down) and the 6 series (introduced in 2014).

PowerVR 6

In 2013 companies will launch processors using IP from Imagination Technologies, the PowerVR G6230 and G6430. Named licensees are:

ST-Ericcson: NovaThor A9600 – not available yet or even mentioned on their own webpage.
Texas Instruments: no products anounced, latest OMAP5-series are PowerVR 5 based.
Renesas Electronics: no products announced, latest are based on Series5 (SGX54x and 53x).
MediaTek: no products anounced, latest MT6577 is based on Series5.
HiSilicon: licensed, but no products announced.

While a lot of news was around this platform, it has been delayed several times. Their latest designs in the series are running on an FPGA and more details will be given at CES 2013 (source).

Performance

Below you see where the PowerVR6 stands. It is clocked much higher (from 250MHz to 600MHz), which suggests it will be baked sub-sub 45 nm. The PowerVR 5 is used in for example the iPad2 and delivers around 70GFlops. The PowerVR6 G62x0 is promised to deliver 200GFlops and up. The TFLOPS barrier is promised to be broken with the series.

3_cpu_vs_gpu_GFLOPS_bars — Comparison between PowerVR 5 and 6 series.

http://withimagination.imgtec.com/powervr/powervr-series6xe-gpus-bring-opengl-es-3-0-graphics-everyone

http://blog.imgtec.com/news/accelerate-design-closure-for-ip-cores-from-imagination-dok-design-flows

The below image shows the 6-series are baked on 32nm and below. It shows different series-identifiers though. The “3” in G6x30 is the addition of “frame buffer compression logic” (source).

PowerVR 5

The chipset that currently dominates mobile devices, from the PSP to tablets to phones to Apple iPad.

Drivers

Imagination only sells IP and refers to their licensees for driver-support. If you have ideas what to do with OpenCL -on-PowerVR, you can request for an NDA here.

Texas Instruments has the drivers available, but only under an SLA. Contact your TI-representative for the most recent information.

Samsung delivers drivers with their Exynos 5 Octa development board (Odroid XU).

Boards

Let me know if you know a TI-board and can give me a description of the business-requirements to get hands on drivers.

Exynos 5410: ODROID-XU

ARM Cortex-A15 Quad 1.6GHz + Cortex-A7 Quad 1.2GHz
PowerVR 5 SGX544 MP3 GPU
2GB LPDDR3 (12.8GB/s memory bandwidth)
Lots of IO-ports (see image below). No wifi without dongle.
CCI-400 bug seems to be fixed (source). Not clear how.

OpenCL info from the FAQ:

Which OpenGL and OpenCL are included in Android?
OpenGL ES 1.1 and OpenGL ES 2.0
OpenCL 1.1 Embedded Profile

Will It run Ubuntu or other Linux distros?
Currently we supports only Ubuntu 13.04 server version with only serial console.
We need to develop HDMI/LCD driver for Xorg display.
We are trying to release Linux BSP with OpenGL/OpenCL in Q4 of 2013.

Buy here. Forum here.

Drivers

See this page for explanation how to write the images. The links don’t get updated, so check the above links for the latest versions.

Minimum price for board+eMMC+shipping is $273,-. Price including some needed (eMMC, shipping), possibly needed (HDMI, USB-UART) and convenience add-on’s (SD, wifi) is:

Board includes adaptor and case. An eMMC is needed for a fast OS, but a micro-SD also works – although slower.

Without the integrated power analysis tool, it’s $30,- less. Ordering only the board is $10 less shipping.

Devices

Once I get more (public) info on OpenCL-drivers, this section will be extended.

Kindle Fire HD

As seen on Engadget, Imagination Technologies is working with Amazon and Texas Instruments to deliver OpenCL-enabled Kindle Fire HDs.

http://www.youtube.com/watch?v=-twOwM4LP9o

The chipset is a OMAP 4470 by Texas Instruments, which contains a PowerVR SGX544 GPU running at 384MHz. It only delivers 24.5GFLOPS.

Imagination Technologies PowerVR

[infobox type=”information”]

Need a PowerVR programmer? Hire us!

[/infobox]

Currently there are two PowerVR GPU architectures with OpenCL support: the 5 series (scroll down) and the 6 series (introduced in 2014).

PowerVR 6

In 2013 companies will launch processors using IP from Imagination Technologies, the PowerVR G6230 and G6430. Named licensees are:

ST-Ericcson: NovaThor A9600 – not available yet or even mentioned on their own webpage.
Texas Instruments: no products anounced, latest OMAP5-series are PowerVR 5 based.
Renesas Electronics: no products announced, latest are based on Series5 (SGX54x and 53x).
MediaTek: no products anounced, latest MT6577 is based on Series5.
HiSilicon: licensed, but no products announced.

While a lot of news was around this platform, it has been delayed several times. Their latest designs in the series are running on an FPGA and more details will be given at CES 2013 (source).

Performance

http://withimagination.imgtec.com/powervr/powervr-series6xe-gpus-bring-opengl-es-3-0-graphics-everyone

http://blog.imgtec.com/news/accelerate-design-closure-for-ip-cores-from-imagination-dok-design-flows

The below image shows the 6-series are baked on 32nm and below. It shows different series-identifiers though. The “3” in G6x30 is the addition of “frame buffer compression logic” (source).

PowerVR 5

The chipset that currently dominates mobile devices, from the PSP to tablets to phones to Apple iPad.

Drivers

Imagination only sells IP and refers to their licensees for driver-support. If you have ideas what to do with OpenCL -on-PowerVR, you can request for an NDA here.

Texas Instruments has the drivers available, but only under an SLA. Contact your TI-representative for the most recent information.

Samsung delivers drivers with their Exynos 5 Octa development board (Odroid XU).

Boards

Let me know if you know a TI-board and can give me a description of the business-requirements to get hands on drivers.

Exynos 5410: ODROID-XU

ARM Cortex-A15 Quad 1.6GHz + Cortex-A7 Quad 1.2GHz
PowerVR 5 SGX544 MP3 GPU
2GB LPDDR3 (12.8GB/s memory bandwidth)
Lots of IO-ports (see image below). No wifi without dongle.
CCI-400 bug seems to be fixed (source). Not clear how.

OpenCL info from the FAQ:

Which OpenGL and OpenCL are included in Android?
OpenGL ES 1.1 and OpenGL ES 2.0
OpenCL 1.1 Embedded Profile

Buy here. Forum here.

Drivers

See this page for explanation how to write the images. The links don’t get updated, so check the above links for the latest versions.

Minimum price for board+eMMC+shipping is $273,-. Price including some needed (eMMC, shipping), possibly needed (HDMI, USB-UART) and convenience add-on’s (SD, wifi) is:

Board includes adaptor and case. An eMMC is needed for a fast OS, but a micro-SD also works – although slower.

Without the integrated power analysis tool, it’s $30,- less. Ordering only the board is $10 less shipping.

Devices

Once I get more (public) info on OpenCL-drivers, this section will be extended.

Kindle Fire HD

As seen on Engadget, Imagination Technologies is working with Amazon and Texas Instruments to deliver OpenCL-enabled Kindle Fire HDs.

http://www.youtube.com/watch?v=-twOwM4LP9o

The chipset is a OMAP 4470 by Texas Instruments, which contains a PowerVR SGX544 GPU running at 384MHz. It only delivers 24.5GFLOPS.

ST-Ericsson NovaThor LP9600 (Nova A9600)

This chipset has a dual-core ARM Cortex-A15 2,3 GHz, 28 nm, PowerVR series6 and “d-channel LP-DDR2”. It should become available in Q1 2013.

Since Ericcson will leave ST-Ericcson after a transition period, it is unclear if the chipset is delayed.

Engineering GPGPU into existing software

Posted by Vincent Hindriksen on 5 November 2010 with 1 Comment

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software” →

Assessment of existing code-base’s quality

You found the main computation takes over 90% of the processing time, or you found the framework to be slow in general. We got in contact and discussed speeding up your software, after which we shared this assessment with you. This assessment should give you insights if the software-project is ready to be shared with Stream HPC.

The larger the code-base, the more important its code-quality

The first step is to prepare code for porting or optimization. As it’s not always easy to know what to do, we’ve defined 3 levels of code quality. The higher the quality-level, the fewer obstacles for the project and the lower the costs, the less communication is required and the fewer frustrations.

Preparing a project for porting / optimizing

The sections below discuss the 3 levels where a project can be. The goal is that you do a self-assessment and write down answers for each question with details, not only provide the final answer.

The code needs to have all levels marked in red: full level 1 and the high level of level 2. It does not matter if the existing code is written in Matlab, Python, C, C++, Assembly, OpenCL, CUDA or any other language.

The action points of all levels need to be done. When a project is not ready yet, we can assist in improving code-quality, but assume a lot has to be done by your team. This will be a separate (pre-)project, and the full estimation for the porting/optimization can only be done after that.

If not possible to level up the software or no source files are available, it will be handled as a black box project or R&D project. Do know that such projects can never be done fixed priced and are always unique. The generic part of the process is described in this blog – we are experienced in doing such focused R&D projects.

Level 1: Understandability

Goal: Can the software be understood without help from the main developers?

High level

Are the algorithms explained in i.e. scientific papers?
Do alternate implementations exist? I.e. in Python or Matlab.
Optional: is there a presentation/overview on the algorithm?

Mid level

Is there a software design document?
Is test-data provided that can be used to run the code?

Low level

Are all functions documented?
Is it clear what each (part of the) function is doing?
Are all variables documented?
Is it clear what each variable means?

Action points

A good communication-plan to get questions answered quickly. This includes a direct contact and regular calls.
Walk the code and improve/update the existing in-code documentation. Often it’s not looked at since it was written.

Level 2: Testability

Goal: Are there few bugs and can new bugs be easily detected?

High level

Is there a golden standard of the standard, such like a proof-of-concept or the existing code? Is it available to us?
Are the outputs deterministic? Or can they be made deterministic quickly?
Is there a good understanding where the algorithm is less stable to unstable, and can this be explained?
Is there clarity on the required maximum quantitative errors that would define the correctness of output? Are there high level test-cases for all of these?
Is there clarity on required maximum qualitative errors that would define the correctness of output? Is the collection of examples and counter-examples large enough to define the error in full?
Can the whole library tested using the sample input and output?
Is there clarity on the required precision? When is an answer correct?

Mid Level

Does the CI contain static code analysis?
Are there test-cases or automated tools for finding qualitative errors?

Low level

Are the compute-intensive functions covered by functional tests?
Are the most important functions covered by functional tests?
Are the other functions covered by functional tests?

Action points

Making the code deterministic by temporary removing random number generator.
Define what is correct output in detail.
Complete the test cases for the different types of errors.
Creating several sets of data-in and its correct data-out. This will be used for acceptance of ported code, giving the maximum errors allowed.
Decide which sub-results need to be compared.

Level 3: Quality

Goal: Is the software easy to maintain and extend?

High level

Is there build-automaton, like Cmake or Meson?
Is the code multi-OS?
Are tests run with every commit, or at least daily/nightly?

Mid level

Are functions defined to only do one thing and do it well?
Nobody of the team labelled the code as “Spaghetti code”?
Are function names self-explanatory?
Are variable-names self-explanatory?

Low level

There is no duplicated code?
There is no code that actually can be made much less complex?
There are no functions or methods longer than 100 lines excluding documentation?
There are no large classes?
There are no functions or methods with far more than 7 parameters?
There is only few commented out code?
There is only few dead code? This is code that is not being called from anywhere.
There is no code that should be rewritten?
There is no variables that are being reused?
There is no significant dependence on global state?

Action points

Prepare the (new) GPU-code for continuous integration.
Document how the ported code maps to the existing code.

Costs Assessment

Code at level 0 can take 10 – 20 times more time than code at level 1. How much exactly is difficult to say, and that’s exactly the reason we require a minimal level. The bitter pill is that the costs of not cleaning up the code is often even higher due to increased hardware costs, increased maintenance costs and lack of innovation.

There is only one reason to keep code quality under level 1: when it’s going to be replaced within a month.

When level 1 is mostly done, each missing item can take 20% to 40% of extra costs. A good example is not having good test-cases or no CPU-code available or not even an executable. This adds costs in both the implementation phase and the acceptance phase. Making a new CPU-implementation first is often cheaper.

From the described minimal level (level 1 in full, level 2 the high level) costs are more predictable and less costly. When getting a quote, these can be requested to be mentioned and you can choose to do it yourself or let us do it.

Contact

Just discuss your goals, after done an assessment. We can guide you in prioritization of making your code ready for porting.

Email to info@streamhpc.com or look to the other contact options on the contact page here. Then we’ll schedule a call and discuss what you need.

Copyright

If you find this self-assessment useful, know it took us a long time to improve this document and a lot of experience is hidden in it. But as we find quality software very important, we’re releasing this list under CC-BY-NC-ND: it cannot be altered or used commercially, and it must have a clear reference to Stream HPC as the authors. In other words: it must be clear that you did not do the research or writing of this assessment yourself.

If you have feedback or suggestions, we’d really like to hear from you!

Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

Edit: About the multi-GPU configuration

OpenCL header files

OpenCL.lib / libOpenCL.so

Sony Catalyst Family

Colorfront Express Dailies and On-Set Dailies

RED REDCINE-X PRO

The Foundry Nuke Blink framework

Magix Hybrid Video Engine

Adobe CS6 creative suite

Sony Vegas Pro

RealFlow Hybrido2 engine

Autodesk Maya

ArcSoft SimHD and Sim3D engine

BlackMagic Design

(DaVinci) Resolve

(Eyeon) Fusion

Roxio Creator Suite

Apple Final Cut Pro and iMovie

Blender Cycles & Bullet

Side Effects Houdini

DigitalFilmTools

OTOY OctaneRender 3

SAM Alchemist XF

More?

Table of Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

OpenCL Booth

Schedule

Sunday, 17 November

Monday, 18 November

Tuesday, 19 November

Wednesday, 20 November

Thursday, 21 November

Khronos Members Exhibiting at SC13

At SC13 and saw great OpenCL demos or news?

GPU Manufacturers

AMD

NVIDIA

Intel

Vivante

Takumi

Imagination Technologies (ImTech)

Toshiba

S3

CPU Manufacturers

Intel

ARM

IBM

Freescale

Systems on a Chip (SoC)

3DLABS (ZiiLabs)

Movidia

Texas Instruments

Qualcomm

Apple

Samsung

Broadcom

Seaweed

ST Microelectronics

Handheld Manufacturers

Apple

Nokia

Motorola

Super-computers

IBM

Fujitsu

Los Alamos National Laboratory

Petapath

NVIDIA

Other Hardware

GE

ST-Ericsson

Software Developers

Graphic Remedy

RapidMind

HI

MascotCapsule V4 product specification

Codeplay

QNX

Fixstars