The OpenCL power: offloading to the CPU (AVX+SSE)

Posted by Vincent Hindriksen on 28 November 2012 with 2 Comments

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;
}

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

Continue reading “The OpenCL power: offloading to the CPU (AVX+SSE)” →

AMD positions FirePro S10000 against both TESLA K10 (and K20)

Posted by Vincent Hindriksen on 15 November 2012

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)” →

AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Posted by Vincent Hindriksen on 13 September 2012 with 8 Comments

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000” →

The CPU is dead. Long live the CPU!

Posted by Vincent Hindriksen on 15 August 2012 with 2 Comments

Gladiator-Thumb-Down1 — Scene from Gladiator when is decided on the end of somebody’s life.

Look at the computers and laptops sold at your local computer shop. There are just few systems with a separate GPU, neither as PCI-device nor integrated on the motherboard. The graphics are handled by the CPU now. The Central Processing Unit as we knew it is dying.

To be clear I will refer to an old CPU as “GPU-less CPU”, and name the new CPU (with GPU included) as plain “CPU” or “hybrid Processor”. There are many names for the new CPU with all their own history, which I will discuss in this article.

The focus is on X86. The follow-up article is on whether the king X86 will be replaced by king ARM.

Know that all is based on my own observations; please comment if you have nice information.

Continue reading “The CPU is dead. Long live the CPU!” →

How expensive is an operation on a CPU?

Posted by Vincent Hindriksen on 16 July 2012 with 13 Comments

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about ~~Perl~~ LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

This article is just to help you on your way, and most of all: to make you aware. Note it is incomplete and probably not valid for all kinds of CPUs.

Continue reading “How expensive is an operation on a CPU?” →

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

Posted by Vincent Hindriksen on 12 October 2011 with 6 Comments

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL” →

AMD OpenCL coding competition

Posted by Vincent Hindriksen on 3 October 2011 with 1 Comment

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition” →

OpenCL Developer support by NVIDIA, AMD and Intel

Posted by Vincent Hindriksen on 14 April 2011 with 2 Comments

There was some guy at Microsoft who understood IT very well while being a businessman: “Developers, developers, developers, developers!”. You saw it again in the mobile market and now with OpenCL. Normally I watch his yearly speech to see which product they have brought to their own ecosphere, but the developers-speech is one to watch over and over because he is so right about this! (I don’t recommend the house-remixes, because those stick in your head for weeks.)

Since OpenCL needs to be optimised for each platform, it is important for the companies that developers start developing for their platform first. StreamComputer is developing a few different Eclipse-plugins for OpenCL-development, so we were curious what was already there. Why not share all findings with you? I will keep this article updated – know this article does not cover which features are supported by each SDK.

Continue reading “OpenCL Developer support by NVIDIA, AMD and Intel” →

SSEx, AVX, FMA and other extensions through OpenCL

Posted by Vincent Hindriksen on 7 February 2011 with 1 Comment

This discussion is about a role OpenCL could play in a diversifying processor-market.

Both AMD and Intel have added parallel instruction-sets for their CPUs to accelerate in media-operations. Each time a new instruction-set comes out, code needs to be recompiled to make use of it. But what about support for older processors, without penalties? Intel had some troubles with how to get support for their AVX-instructions, and choose for both their own Array Building Blocks and OpenCL. What I want to discuss here are the possibilities available to make these things easier. Also I want to focus on if a general solution “OpenCL for any future extensions” could hold. I make an assumption that most extensions target mostly parallelisation with media in mind, most notable embedded GPUs on upcoming hybrid processors. I talked about this subject before in “The rise of the GPGPU compiler“.

Virtual machines

Java started in 1996 with the idea that end-point optimisation should be done by compiling intermediate code to the target-platform. The idea still holds and there are many possibilities to optimise intermediate code for SSE4/5, AVX, FMA, XOP, CLMUL and any other extension. Same is of course for dotNET.

Disadvantage is the device-models that are embedded in such compilers, which have not really take specialised instructions into account. So if I have a normal loop, I’m not sure it will work great on processors launched this year. C has pragmas for message-protocols, Java needs extensions. See Neal Gafter’s discussion about concurrent loops from 2006 for a nice discussion.

Smart Compilers

With for instance LLVM and Intel’s fast compilers, a lot can be done to get code optimised for all current processors. A real danger is that too many specialised processors will arrive the coming years; how to get maximum speed at all processors? We already have 32 and 64 bit; 128 bit is really not the only direction there is. Multi-target compilers can be something we should be getting used to, for which no standard is created for yet – only Apple has packed 32 and 64 bits together.

Years ago when CPUs started to have support for the multiply-add operation, a part of the compiled code had to be specially for this type of processor – giving a bigger binary. With any new type of extension, the binary gets bigger. It has to, else the potential of your processor will not be used and sales will drop in favour of cheaper chips. To sell software with support for each new extension, it takes time – in most cases reserved only for major releases.

Because not everybody has Gentoo (A Linux-distribution which compiles each piece of software targeting the user’s computer for maximum optimisation), it takes at least a year to get full use of the processor for most software.

OpenCL

So where does OpenCL fit in this picture? Virtual machines are optimised for threads and platform-targeting compilers are slow in distribution. Since drivers for CPUs are part of the OS-updating system, OpenCL-support in those drivers can get the new extensions utilised soon after market-introduction. The coming year more will be done for automatic optimisation for a broad range of processor-types – more about that later. This focus from the compiler to an OpenCL-library for handling optimal kernel-launching will get an optimum somewhere in between.

The coming time we will see OpenCL is indeed a more stable solution than for instance Intel’s Array Building Blocks, seen from the light of recompiling. If OpenCL can target all kinds of parallel extensions, it will offer the demanded flexibility the market demands in this diversifying processor-market. I used the word ‘demand’, because the consumer (being it an individual or company) who buys a new computer, wants his software to be faster, not potentially faster. What do you think?

OpenCL mini buying guide for X86

Posted by Vincent Hindriksen on 17 January 2011 with 6 Comments

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86” →

OpenCL on the CPU: AVX and SSE

Posted by Vincent Hindriksen on 8 December 2010 with 2 Comments

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE” →

New grown-ups on the block

Posted by Vincent Hindriksen on 24 October 2010 with 2 Comments

There is one big reason StreamHPC chose for OpenCL and that is (future) hardware-support. I talked about NVIDIA versus AMD a lot, but knowing others would join soon. AMD is correct when they say the future is fusion: hybrid computing with a single chip holding both CPU- and GPU-cores, sharing the same memory and interconnected at high speed. Merging the technologies would also give possible much higher bandwidths to memory for the CPU. Let us see in short which products from experienced companies will appear on the OpenCL-stage.

Continue reading “New grown-ups on the block” →

OpenCL – the battle, part III

Posted by Vincent Hindriksen on 18 October 2010 with 2 Comments

The first two parts described hardware-companies and operating systems, programming languages and software-companies, written about half a year ago. Now we focus on what has driven NVIDIA and ATI/AMD for decades: games.

Disclaimer: this is an opinion-piece on the current market. We are strong supporters of OpenCL and all companies which support it too. Since our advise on specific hardware in a consult will be based on specific demands on the customer, we could advise differently than would be expected on the below article.

Games

Computer games are cool; merely because you choose from so many different kinds. While Tetris will live forever, the latest games also have something to add: realistic physics simulation. And that’s what’s done by GPUs now. Nintendo has shown us that gameplay and good interaction are far more important than video-quality. The wow-factor for photo-realistic real-time rendering is not as it was years ago.
You might know the basics for falling objects: F = m*g (Force = Mass times Gravity-acceleration), and action = – reaction. If you drop some boxes, you can predict falling speed, interaction, rotation and possible change of centre of gravity from a still image as a human being. A computer has to do a lot more to detect collision, but the idea is very doable on a fast CPU. A very well-known open source library for these purposes is Bullet Physics. The nice thing comes, when there is more than just a few boxes, but thousands of them. Or when you walk through water or under a waterfall, see fire and smoke, break wood but bend metal, etc. The accelerometer of the iPod was a game-changer too in the demand for more realism in graphics. For an example of a “physics puzzle game” not using GPGPU see World of Goo (with free demo) – for the rest we talk more about high-end games. Of current game-ready systems PCs (Apple, Linux and Windows) have OpenCL support, Sony PlayStation 3 is now somewhat vague and the Xbox 360 has none.

The picture is from Crysis 3, which does not use OpenCL, as we know it.

Continue reading “OpenCL – the battle, part III” →

X86 Systems-on-a-Chip and GPGPU

Posted by Vincent Hindriksen on 4 May 2010 with 3 Comments

The System-on-a-chip (SoC) for X86 will be a revolution for GPGPU. Why? Because currently a big problem is transferring data from CPU-memory to GPU-memory and back, which will be solved with SoCs. Below you can read this architecture-target is very possible.

With AMD+ATI, Intel and its future high-end GPUs, and NVidia with the rumours around its X86-chips, we will certainly get changes in the field. If it is the way to go, what is probable?

Get both CPU and high-end GPU on 1 chip, separated memory
Techniques for sharing memory
Translating OpenCL from and to C on the fly

ARM-processors are combined with GPUs a lot of times, but they don’t have current support for a common shader-languages (read: OpenCL) to make GPGPU in reach. We’ve asked ourselves many times why ARM & friends are involved in OpenCL since the beginning, but still don’t have any public and promoted driver-support. More on ARM, once there is more news on multi-core ARM-CPUs or OpenCL drivers.

1: One chip for everything

The biggest problem with split CPU/GPU-functionality is the bus-speed between the two is limited. The higher this speed, the more useful GPGPU can be. The highest speeds are possible when the signal does not have to leave the chip and there are no concessions made to the architecture of the graphics-card, in other words: glueing CPU and GPU together, but leave the memory-buses the same.

Currently there is Intel’s Nehalem and AMD’s Fusion, but they use DDR3 for both GPU and CPU; this will not really unlock the GPGPU-possibilities of high-end GPUs. It seems these products were designed with lower costs in mind.

But the chances high-end GPUs will be integrated on the CPU is rising. Going to 32nm gives room for more functionality, such as GPUs. Other choices can be smaller chips, more cores and integrating functionality of the north/south-bridge of the motherboard. If GPU-cores can be turned off when not working optimally when being tested in the factory (just like they do with mult-core CPUs), integrating high-end GPU-cores will even become a save choice.

Another way it could go is using optical buses between the GPU and CPU. It’s unknown if it will really see mainstream markets soon enough.

2: Shared memory – new style

Some levels of cache and all memory should be easy accessible by both types of cores. Why? Because eventually you want to switch between CPU- and GPU-instructions continuously. CUDA has a nice feature already, which keeps objects synchronised between CPU and GPU; one step further is leaving out the need of synchronising.

The problem is that video-memory is accessed more parallel to provide higher data-speeds (GDDR5), so we don’t want to limit the GPU by attaching them to slower (=lower bandwidth) DDR3. Doing it the other way would then be the best solution: giving CPUs direct access to GDDR. There is always a probable option that a new type of (replaceable) memory will be used, which has a dual-bus by design.

The hard part is memory-protection; since now more devices get control to memory, the overhead of controlling/arranging the spots can increase enormously and might need a separate core for it – just like the Cell-processor. This need-for-control is a reason I don’t expect access to each other memory before there will be a fast bus between GPU and CPU, since then the access to GDDR via the GPU’s memory-manager will be much faster and maybe even fast enough.

3: Grown up software

If software would be able to easily select devices and use the same code for each device, then we’ve made a giant step forwards. Software has always been one step behind hardware; so when you do not develop such techniques, you just have to wait a while.

Translating OpenCL into normal C and back will be possible in all kinds of ways, once there is more acceptance of (and thus demand for) GPGPU. AMD’s OpenCL-implementation for CPUs is also a way to merge the fields of CPU and GPU. It’s hard to tell how these techniques will merge, but it will certainly happen. Think of situations that some instructions will be sent to the GPU by the OS even when the (OpenCL) programmer did not think of it. Or do you expect to be an ARM-processor integrated in a near-future CPU, when you write an OpenCL-kernel now?

See our article on the bright future of GPGPU to read more about it.

What’s next?

In case this is the way it goes, there will be a lot possible for both OpenCL and CUDA – depending on market demands. Some possibilities will be discussed in an upcoming article about FPGAs, but also let me hear what you think about X86-SoCs. Comment or send an e-mail.

All the members of the OpenCL working group 2010

Posted by Vincent Hindriksen on 11 April 2010

(If you’re searching for companies who offer OpenCL-products and services, please visit OpenCL:Pro)

You probably have heard AMD is on the OpenCL working group of Khronos; but there are many more and they possibly all have plans to use it. Here is an overview, so you can make your own conclusions about the future that lays ahead. Is your company on “the list”?

We’re specially interested in the less known companies, so most information is about the companies you and us possibly have not heard from before. We’ve made assumptions what the companies use OpenCL for, so we need your feedback if you think we’re wrong! Most of these companies have not openly written about their (future) accelerated products, so we had to make those guesses.

Disclaimer: All brand and product names are or may be trademarks of, and are used to identify products or services of, their respective owners.

Last updated 6-Oct-2010.

GPU Manufacturers

GPUs being the first products targeted by OpenCL, we blast away with a list of CPU-manufacturers. You might see some unknown companies and now know which companies missed the train; it is pretty clear why GPU-manufacturers have interest in OpenCL.
We skip the companies who have a GPU-stack built upon ARM-techology and only focus on pure GPU-manufacturers in this category.

AMD

We’ve already discussed the biggest fan of OpenCL several times. While having better GPU-cards than NVIDIA (arguable per quarter of the year), they put their bets completely on OpenCL. They even get credits like “AMD’s OpenCL” when compared with NVIDIA’s CUDA.

The end of 2010, beginning of 2011 they will ship their Fusion-product having a CPU and GPU on one chip. The first Fusion-chips will not have a high-end GPU because of heating problems, is told to PC-store employees.

NVIDIA

AMD’s biggest competitor with the very well marketed similar product CUDA. Currently they have the most specialised products in market for servers. While they put more energy in their own technology CUDA, it must be said that they have adopted OpenCL more than any other hardware vendor.

Intel

The biggest part of the CPU-market is for Intel en guess once, who has the biggest GPU-market in hands? Correct: onboard-GPUs are Intel’s speciality, but their high-end GPU Larrabee might once see the market. Just like AMD they have the technology (and products) to have an integrated CPU/GPU which will be very interesting for the upcoming OpenCL-market.

They are openly interested in OpenCL. Here is a nice interview which explains how a CPU-designer looks at GPU-designs.

Vivante

Vivante manufactures GPU-chips. They claim their OpenGL ES 2.0-compliant silicon footprint is the smallest on the market. There is a lot of talk about OpenGL Shader Language (OpenCL’s grandpa), for which their products are very well suited for. Quote: “The recent trend in graphics hardware has been to replace fixed functionality with programmability in areas that have grown exceedingly complex, such as vertex processing and fragment processing. The OpenGL® Shading Language was designed to allow application programmers to express the processing that occurs at those programmable points of the OpenGL pipeline. Independently compilable units written in this language are called shaders. A program is a set of shaders that are compiled and linked together.”

Takumi

Japanese corporation Takumi manufactures the GSHARK, a 2D/3D hardware accelerator. The focus is on shaders, like Vivante.

Imagination Technologies (ImTech)

From their homepage: >>POWERVR enables a powerful and flexible solution for all forms of multimedia processing, including 3D/2D/vector graphics and general purpose processing (GP-GPU) including image processing.

POWERVR’s unique tile-based, deferred rendering/shading architecture allows a very small area of a die to deliver higher performance and image quality at lower power consumption than all competing technologies. All major APIs are supported including OpenGL ES 2.0/1.1, OpenVG 1.1, OpenGL 2.0/3.0 and DirectX9/10.1 and OpenCL.<<

Currently all ARM-based OpenCL-capable devices have POWERVR-technology.

Toshiba

Like other huge Japanese everything-factories, you don’t know what else they make. Besides rice cookers they also make multimedia chips.

S3

Once they were big in the consumer-market of graphics cards, but S3 still exists as a more business-oriented manufacturer of graphics products.

CPU Manufacturers

We miss the Power Architecture, but IBM and Freescale are members of this group.

Intel

While AMD tries to make OpenCL available for the CPU, we have not heard of a similar product from Intel yet. They see a future for multi-core CPUs, as seen in these slides.

ARM

Most known for its same-named low-power processor, not supported by MS Windows. You can read below how many companies have a license on their technology. Together with POWERVR-technology they power all the embedded OpenCL devices of the coming year.

IBM

Currently they are most known for their Cell-processor (co-developed with Toshiba and Sony) and have a license to build PowerArchitecture-CPUs. The Cell has full OpenCL-support as first non-GPU. Older types of PS3s (without the latest firmware) ad IBM’s servers can use the power of OpenCL. End of June 2010 Khronos conformed their “Development Kit for Linux” for Power VMX and PowerXCell8i processors.

Freescale

Once a Motorola-division, they make lots of different CPUs. Besides ARM- and PowerArchitecure-based ones, they also have it’s own ‘Coldfire’. We cannot say for which architecture they are interested in OpenCL, but we really would like to hear something from them since they can open many markets for OpenCL.

Systems on a Chip (SoC)

While it is cool to have a GPU-card in your pc, more and more the Graphics-functionality is integrated onto a CPU. Especially in the mobile/embedded/gadget-market you’ll find such System-on-a-Chip solutions, which are actually all ARM- or PowerArchitecture based.

3DLABS (ZiiLabs)

Creators of embedded hardware with focus on handhelds. They have partners of Khronos for a long time, having built the first merchant OpenGL GPU, the GLINT 300SX. They have just released a multimedia-processor, which is an ARM-processor with pretty interesting graphic capabilities.

They have an “early access program for OpenCL” for their ZMS product line.

Movidia

On their Technology overview-page they imply they have flexible accelerators in their designs, which *could* in the future be controlled by OpenCL-kernels. They manufacture mobile GPUs-plus-loads-of-extras which are quite impressive.

Texas Instruments

Besides ARM-based processors they also have DSPs. We watch them, for which product they have OpenCL in mind.

Qualcomm

They might be most famous for their ARM-based Snapdragon-chipset. They have much more products, but we think they start with Snapdragon before building OpenCL in other products.

Apple

The Apple A4 powers their new products, the iPad. It becomes more and more clear Apple has really learned that you cannot rely on one supplier, after waiting for IBM’s G6. With OpenCL Apple can now make software that works on ARM, all kind of GPUs and CPUs.

Samsung

They make anything that is fed by batteries, so for that reason they should be in the “other” category: mobile phones, mp3-players, photo-cameras, camcorders, laptops, TVs, DVD-players and Bluray-players. All products where OpenCL can wield.

A good reason to make their own semi-conductors, ARM-based.

In the beginning of June 2010 they have launched their own Linux-based OS for mobiles: Bada.

Broadcom

Manufactures networking and communications ICs for data, voice, and video applications. They could use OpenCL for their mobile multimedia processors.

Seaweed

Since September acquired by Presagis. We cannot be sure they continue the OpenCL-business of Seaweed, but at least GPGPU is mentioned once.

Presagis is “the worldwide leader in embedded graphics solutions for mission-critical display applications. The company has provided human-machine interface (HMI) graphical modeling tools, drivers and devices for embedded systems for over 20 years. Presagis pioneered both the prototyping of display graphics and automatic code generation for embedded systems in the 1990s. Since then, code generated by its flagship HMI modeling products has been deployed to hundreds of aircraft worldwide and its software has been certified on over 30 major aircraft programs worldwide. Presagis is your trusted partner for reliable, high-performance embedded graphics products and services.”

ST Microelectronics

ST has many products: “Singapore Technologies Electronics is a leader in ICT. It has main businesses in Enterprise, Satellite Communications and Interactive Digital Media. It is divided into several Strategic Business Units consisting of Info-Comms, Info-Software, Training and Simulation, Electro-Optics, Large Scale Group, Satcom & Sensor Systems.”

We think they’ve shown interest for OpenCL for use with their Imaging processors. Together with Ericsson they have a joint-venture in de mobile market, ST-Ericsson.

Handheld Manufacturers

While most companies will find it hard to make OpenCL-business in the consumer-market, consumer-products of other companies make sales a little bit warmer.

Apple

At least the iPad and iPhone have hardware-capabilities of running OpenCL. It is expected that it will come available in the next major release of the iPhone-OS, iOS 4. We’re waiting for more news.

Nokia

The largest manufacturer of mobile phones from Finland has a lot of technology. Besides smartphones, possibly a netbook (in cooperation with Intel) they also have Symbian and the QT-library. Since a while QT has support for OpenCL. We think the support of OpenCL in programming languages (in a more high-level way) is very important. See these slides to read some insights of the company.

Motorola

They have consumer products like mobile phones and business products like networking. It is not clear where they are going to use OpenCL for, since they mostly use other companies’ technologies.

Super-computers

While OpenCL can revive old computers once upgraded with a new GPU, imagine what they can do with Super-computers.

IBM

IBM builds super-computers based on different technologies. With OpenCL-support for their Power VMX and PowerXCell8i processors, it is already possible to use OpenCL with IBM-hardware.

Fujitsu

They have many products, but they also make super-computers which use GPGPU.

Los Alamos National Laboratory

They build super-computers and really can use the extra power.

A job-post talks about heterogeneous architectures and OpenCL.

Petapath

Petapath, founded in 2008, focuses on delivering innovative hardware and software solutions into the high performance computing (HPC) and embedded markets. As can be seen from their homepage they build grids.

NVIDIA

As a newcomer in the super-computer business, they do very well having helped to build the #2 HPC. Many clusters are upgraded with their streaming-processors.

Other Hardware

We don’t know what they are actually doing with the technology, purely because they are to big to make assumptions.

GE

US-based electronics-giant General Electronics builds everything there is, fed by electricity and now also GPGPU-powered solutions as can be found on their GPGPU-page. They probably switched to CUDA.

ST-Ericsson

Ericsson together with ST they have a joint-venture in de mobile market, ST-Ericsson. Ericssson is big in (mobile) networking. It also builds mobile phones with Sony. It is unclear what the joint-venture wants to do with the technology, but it must be mobile.

Software Developers

While OpenCL is very close to hardware, we have to talk software too. Did anybody say there is a strict line between hardware and software?

Graphic Remedy

Builders of debugging software. You will hear later more from us about this company soon. See something about debugging in this presentation.

RapidMind

RapidMind provided a software product that aims to make it simpler for software developers to target multi-core processors and accelerators (GPUs). It was acquired by Intel in august 2009.

HI

Japanese corporation HI has a product MascotCapsule, which is a real-time 3D rendering engine (native library) that runs on embedded devices. We see names of other companies, except SMedia. If you’re not familiar with mobile GPUs, here you have a list.

This is another big hint, OpenCL will have a big future on mobile devices.

MascotCapsule V4 product specification

Operating environment	CPU	ARM: ARM9 or above Freescale: i.MX Series Marvell: XScale Qualcomm: MSM6280/6550/7200/7500 etc. Renesas Technology: SH-Mobile etc. Texas Instruments: OMAP 32-bit 150 MHz or above is recommended (Capable of running without a floating-point hardware)
	Code size	Approx. 200 KB
	Engine work area	2 MB or more is recommended, including data load area Note: The actual required work area varies depending on the content
3D hardware accelerator		ATI: Imageon Imagination Technologies: PowerVR MBX/MBX Lite/SGX NVIDIA: GoForce SMedia: Glamo TAKUMI: GSHARK Toshiba: T4G/T5G Other OpenGL ES compliant 3D accelerators
OS/platforms		BREW, iPhone, iPod touch, ITRON, Java, Linux, Symbian OS, Windows CE, Windows Mobile
3D authoring tools		3ds Max 9.0/2008/2009/2010 Maya 8.5/2008/2009/2010 LightWave3D 7.5 or later SOFTIMAGE\|XSI 5.x/6.x/7.0

Codeplay

They are most famous for their compilers for the Playstation. They also make code-analysis software.

QNX

From their homepage: “Middleware, development tools, realtime operating systemsoftware and services for superior embedded design”. Their real-time OS in all kinds of embedded products and they might want to see ways to support specialised low-power chips.

RIM acquired QNX in april 2010.

Fixstars

Newcomer in the list 2010. Famous for their PS3-Linux and for their OpenCL-book. They also have FOXC, Fixstars OpenCL Cross Compiler. They have written one of the few books for OpenCL.

Kestrel Institute

http://www.kestrel.edu/ does not show anything GPGPU. We’ll probably hear from them when the next version of their Specware-product is finished.

Game Designers

Physics-calculations and AI are too demanding to do on a CPU. The game-industry keeps pushing the GPU-industry, but now on a different way than in the 90’s.

Electronic Arts

This game-studio builds loads and loads of games with impressive AI. See these slides to see what EA thinks GPGPU can do.

Activision Blizzard

Yes, they are one company now, so now they are together famous for best-selling hit “World of Warcraft”. Currently not much is known where they use OpenCL for, but probably the same as EA.

Thank you for your interest in this article

If you know more about OpenCL at these companies or job-posts, please let us know via comment or via e-mail.

We’ve made some assumptions about what these companies use OpenCL for – we need your feedback!

Tag: AMD

Why use the CPU for vector-computations?

Edit: About the multi-GPU configuration

Virtual machines

Smart Compilers

OpenCL

CPU extensions

Games

1: One chip for everything

2: Shared memory – new style

3: Grown up software

What’s next?

GPU Manufacturers

AMD

NVIDIA

Intel

Vivante

Takumi

Imagination Technologies (ImTech)

Toshiba

S3

CPU Manufacturers

Intel

ARM

IBM

Freescale

Systems on a Chip (SoC)

3DLABS (ZiiLabs)

Movidia

Texas Instruments

Qualcomm

Apple

Samsung

Broadcom

Seaweed

ST Microelectronics

Handheld Manufacturers

Apple

Nokia

Motorola

Super-computers

IBM

Fujitsu

Los Alamos National Laboratory

Petapath

NVIDIA

Other Hardware

GE

ST-Ericsson

Software Developers

Graphic Remedy

RapidMind

HI

MascotCapsule V4 product specification

Codeplay

QNX

Fixstars

Kestrel Institute

Game Designers

Electronic Arts

Activision Blizzard

Thank you for your interest in this article