Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

random_300Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamHPC has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

  • Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
  • Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
  • Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.

Generation

Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.

The most noticeable processors from NVIDIA, AMD and Intel

AMD-Intel-NVidia10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently. 

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

kepler-k80Tesla K80 (Kepler)

  • The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
  • It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

  • 20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
  • 16 GB HBM2 (720 GB/s).
  • NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
  • OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

  • The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
  • The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

  • 0.5 TFLOPS SP (GPU), 1 TFLOPS HP
  • 10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

  • 32GB GDDR5 global memory
  • 2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

FireProS9300X2Radeon Nano and FirePro S9300X2 (Fiji)

  • Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
  • S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
  • Nano 175 Watt, S9300X2 300 Watt
  • Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

  • CPU with built-in GPU
  • About one TFLOPS
  • TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

  • 55MB cache, 22 cores
  • AVX 2.0 (256 bit vector operations)
  • DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPHi_KNL_socketXeonPhi Knights landing

  • Available in socket and PCI version
  • 3 TFLOPS DP, 6 TFLOPS SP
  • AVX 512 (512 bit vector operations)
  • 16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
  • Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

  • Task-parallel processor
  • Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

  • CPU with built-in GPU
  • 0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

Developing AMD-based HPC software

ROCmWith their ROCm initiative AMD re-entered the HPC market with full glory. StreamHPC is developing trainings, where we get into the details of how to build HPC-software for AMD hardware.

During the training we focus on the following parts:

  • Porting existing CUDA software to HIP, which runs without slowdown on both AMD and NVidia GPUs.
  • Optimising OpenMP and OpenACC to run at maximum performance on AMD GPUs.
  • Build C++-based software from scratch, fast, using HCC.

Trainings will be ready by September.

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

The Fastest Payroll System Of The World

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading “The Fastest Payroll System Of The World”

Double the performance on AMD Catalyst by tweaking subgroup operations

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations”

StreamComputing is 7 years!

As of 1 April we are 7 years old. Because of all the jokes on that day, this post is a bit later.

Let me take you through our journey how we grew up from a 1-person company to what we’re now. With pride I can say that (with ups and downs) StreamComputing (now rebranded to StreamHPC) has become a brand that equals to (extremely) fast software, HPC, GPUs and OpenCL.

7 years of changes

Different services

After 7 years it’s also time for changes. Initially we solely worked on OpenCL related services, mostly GPUs. And this is what we’re currently doing:

  • HPC GPU computing: OpenCL, CUDA, ROCm.
  • Embedded GPU computing: OpenCL, CUDA, RenderScript, Metal.
  • Networked FPGA programming: OpenCL.
  • GPU-drivers testing and optimisation.
  • Software architecture optimisations.

While you see OpenCL a lot, our expertise in vendor-specific CUDA (NVidia), ROCm (AMD), RenderScript (Google) and Metal (Apple) cannot be ignored. Hence the “Performance Engineers” and not “GPU consultants” or “OpenCL programmers”.

From Fixers to Builders and getting new competition

Another change is that we have been going from fixing code afterwards to building software.

This has been a slow process and had to do with the confidence in performance engineering as an expert profession instead of a trick. We’re seeing new companies coming into the market and providing GPU-computing next to their usual services. This is a sign of the market growing up.

We’re confident in growing further in our market, as we have the expertise to design fast software while the newcomers have gained expertise to write code that runs on the GPU with only little speedup.

Community: OpenCL:PRO to OpenCL.org

There have been more times when we wanted to support the community more. The first try was OpenCL:PRO and did not live long, as it was actually unclear to us what “the community” wanted.

In the end it was not that hard. Everybody who starts with OpenCL has the same problems:

  • Lack of convenience code, resulting in many, many wrappers and libraries that are incompatible.
  • Lack of practice projects.
  • Lack of overview on what’s available.

With OpenCL.org we aim to solve these problems together with the community. All is shared on Github and anybody can join to complete the information we’ve shared. While our homepage had around 40 pages on these subjects, it was only our personal view on the subjects or had outdated info.

So we’re going to donate most of the OpenCL-related technical pages we’ve written over the years to the community.

There is much more to share – watch our blog, the OpenCLorg twitter and newsletter!

Different Logo

For who remembered: in 2010 the logo looked quite different. We still use the blocks in the background (like on our Twitter account), but since 2014 the colours and font are quite different. This change has been going along with the company growing up. The old logo is careful, while the new one is bold – now we’re more confident about our expertise and value.

Over the past 3 years the new logo has stayed the same and has fully become our identity.

Same kind of customers

It has been quite a journey! We could not have done it without all the customers we served over those 7 years.

Thank you!

Khronos Releases OpenCL 2.2 With SPIR-V 1.2

Today Khronos has released OpenCL 2.2 with SPIR-V 1.2.

The most important changes are:

  • A static subset of the C++14 standard as a kernel language. The OpenCL C++ kernel language includes classes, templates, lambda expressions, function overloads and many other constructs to increase parallel programming productivity through generic and meta-programming.
  • Access to the C++ language from OpenCL library functions to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
  • Pipe storage, which are compile time pipes. It’s a device-side type in OpenCL 2.2 that is useful for FPGA implementations to enable efficient device-scope communication between kernels.
  • Enhanced optimization of generated SPIR-V code. Applications can provide the value of specialization constants at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
  • KhronosGroup/OpenCL-Headers repository has been flattened. From now on, all version of OpenCL headers will be available not at separate branches, but all in master branch in separate directories named opencl10, opencl11 etc. Old branches are not removed, but they may not be updated in the future.
  • OpenCL specifications are now open source. OpenCL Working Group decided to publish sources of recent OpenCL specifications on GitHub, including just released OpenCL 2.2 and OpenCL C++ specifications. If you find any mistake, you can create an appropriate merge request fixing that problem.

This is what we said about the release:

“We are very excited and happy to see OpenCL C++ kernel language being a part of the OpenCL standard,” said Vincent Hindriksen, founder and managing director of StreamHPC. “It’s a great achievement, and it shows that OpenCL keeps progressing. After developing conformance tests for OpenCL 2.2 and helping finalizing OpenCL C++ specification, we are looking forward to work on first projects with OpenCL 2.2 and the new kernel language. My team believes that using OpenCL C++ instead of OpenCL C will result in improved software quality, reduced maintenance effort and faster time to market. We expect SPIR-V to heavily impact the compiler ecosystem and bring several new OpenCL kernel languages.”

Continue reading “Khronos Releases OpenCL 2.2 With SPIR-V 1.2”

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

Nokia Maemo and OpenCL

Update 21-06-2011: Bumped into a project by Nokia: CLEP, “OpenCL Embedded Profile” for the N900.

Maemo is the Debian based Linux-distribution of Nokia for embedded devices. It is on the gadget N900, so you can be root on your own phone and compile your own kernel. In other words: a great developer’s phone.

Which smartphone to buy when you want to toy around with OpenCL “Embedded Profile”? There is more and more evidence that the next iPhone OS will have support for OpenCL, as should be expected Apple being the trademark-owner of OpenCL. This is good, since the mobile market could make the difference for the technique – competing with CUDA and DirectCompute. “The other ARM Cortex-A8 smartphone”, the Nokia N900 does not support it, while the magic of OpenCL attracts to many developers on the Maemo-forums.

The QT-blog that disclosed coming OpenCL-support for QT, spoke about it too:

>>Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we’ve tested it under Linux, Mac, and Windows). Embedded devices are currently another matter – OpenCL implementations are still very basic in that space.  The performance improvements on embedded CPU’s are only slightly better than using ARM/NEON instructions for example.  And embedded GPU’s are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing.  But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with. Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!<<

But checking the whole Nokia QT/Maemo-SDK for something like “opencl.h” or words like “opencl” and “khronos” in .h-files did not return anything interesting. The missing reference in the SDK tells me, we cannot expect any OpenCL-implementation on the N900 soon. So do we have to wait for the Nokia N920, Maemo 6 and QT 4.8? Once I know more, by getting deeper into the SDK, you’re the first to know. But first let me show you the documents which tells us OpenCL is coming to the Maemo-platform.

The Maemo Base Port Document, version 1.1

Exhibit number 1. The introduction tells us that the document describes what hardware-designers should do to get Maemo working on their device:

>>When Maemo is ported to a new chipset and HW environment, the majority of the SW worktakes place in the base layer. However, some adjustments may also be needed in the otherlayers. The porting work as a whole is a combined effort by the chipset vendor and Nokia. Thisdocument describes the deliverables expected from the chipset vendor in such an effort. The requirements in this document are expressed in the form of SW component, interface andfunctional requirements. Note that in many cases more detailed discussions are neededbetween Nokia and the chipset vendor to reach a common understanding about the specificsof the system architecture and the required component versions, functionality and interfaces.<<

So the document describes what the hardware must support, to be able to run Maemo. Let’s then find the magic word “OpenCL”:

>>Graphics Adaptation. The Base Port graphics adaptation interfaces consist of X11, OpenGL ES, and OpenVG interfaces. The OpenCL interface is also included in this group since it typically is used to access the GPU for general-purpose parallel computation.<<

And somewhat below:

>>OpenCL 1. The Base Port should provide an implementation of the OpenCL 1.0 interface for general-purpose parallel programming of heterogeneous systems, especially for the use of GPUs for computation (Khronos group standard).<<

That seems to be pretty clear that Maemo-devices must be able to support OpenCL.

http://www.forum.nokia.com/piazza/wiki/images/7/7d/Maemo_Base_Port_v1.1.pdf

Paper “OpenCL on Embedded devices” by Nokia

Exhibit 2 shows tests of a few simple OpenCL-program on an unnamed device with a TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz POWERVR SGX530 GPU) – which happens to be in the Motorola Droid, Palm Pre, and Nokia N900. So they managed to create a OpenCL-implementation on ARM. If you’re interested in OpenCL for embedded devices, please do read this presentation:

http://www.khronos.org/developers/library/2009-hotchips/Nokia_OpenCL-in-Handheld-Devices.pdf

It is a document from august 2009, which shows they actually were trying POWERVR and OpenCL then. Now with QT and Maemo mentioning it, we can be very sure the N900 or the N920 is eventually going to have OpenCL-support.

OpenCL on the CPU: AVX and SSE

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE”

Phoronix OpenCL Benchmark 3.0 beta

So you want OpenCL-benchmarks? Phoronix is a benchmark for OSX and Linux, created by Michael Larabel, Matthew Tippett (http://en.wikipedia.org/wiki/Phoronix_Test_Suite). On Ubuntu Phoronix version 2.8 is in the Ubuntu “app store” (Synaptic), but 3.0 has those nice OpenCL-tests. The tests are based on David Bucciarelli‘s OpenCL demos. Starting to use Phonornix 3.0 (beta 1) is done in 4 easy steps:

  1. Download the latest beta-version from http://www.phoronix-test-suite.com/?k=downloads
  2. Extract. Can be anywehre. I chose /opt/phoronix-test-suite
  3. Install. Just type ./phoronix-test-suite in a terminal
  4. Use.

WARNING: It is beta-software and the following might not work on your machine! If you have problems with this tutorial and want or found a fix, post a reply.

Continue reading “Phoronix OpenCL Benchmark 3.0 beta”

Keep The Hardware Focus

The real Apu

If you buy a car, the first choice is not often the kind of fuel. You first select on the engine-properties, the looks, the interior, the brand and for sure the total cost of ownership. The costs can be a reason to choose for a certain type of fuel though. In the parallel computation world it is different. There the fuel (CUDA or OpenCL) is the first decision and then the hardware is chosen. I think this is wrong and therefore speak a lot about CUDA-vs-OpenCL, while I think NVidia is a good choice for a whole list of algorithms.

If we give advise during a consult, we want to give the best advice. In case of CUDA, that would be based on budget to go for Tesla or the latest GTX; in case of OpenCL we can give much better advice on hardware. But actually starting with the technique is the worst thing you can do: focus on the hardware and then pick the technique that suits best.

IMPORTANT. The following is for understanding some concepts and limits only! It is pure theoretically, so I don’t claim any real-world results. Also what not is taken into account is how well different processors handle control-instructions (for, while, if, case, etc), which has quite some influence on actual performance.

Continue reading “Keep The Hardware Focus”

OpenCL vs CUDA Misconceptions


Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)


Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

Also found that Google Insights agrees with what I constructed manually.

Continue reading “OpenCL vs CUDA Misconceptions”

InsideHPC: SuperComputing. Where to from here?

In this video, Moderator Bob Feldman hosts a session entitled: Supercomputing: Where to from Here? Recorded at the National HPCC Conference 2011 in Newport.

Panelists:
Dr. Eng Lim Goh, SGI
Bill Feiereisen, Intel
Shumel Shottan, BlueARC
Steve Lyness, Appro International, Inc.
Marc Hamilton, HP Americas

http://www.youtube.com/watch?v=wI957eRr1kM

Below is a summary of what is told. It is just my notes, so go to the times mentioned to listen to the exact answers. Some details I did not write down, you might think are important, but I did not (or missed as I English is not my mother-tongue).

Continue reading “InsideHPC: SuperComputing. Where to from here?”

Basic Concepts: online kernel compiling

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling”

Differences from OpenCL 1.1 to 1.2

This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

Continue reading “Differences from OpenCL 1.1 to 1.2”

ZiiLabs Tablet

[infobox type=”information”]

Need a ZiiLabs ZMS-40 programmer? Hire us!

[/infobox]

Intel has bought ZiiLabs, but you can still order the ZMS-40.

ZiiLabs has an early access program for OpenCL on their StemCell processor, the 100-Core ZMS-40. It could do more than 20 GFLOPS/Watt, but no official numbers have been released.

It consists of:

  • ZMS-40 powered tablet
  • OpenCL compiler (no information if it is cross or native)
  • Code samples

Read more at http://www.ziilabs.com/products/software/opencl.php about their program. Also check the information on the ZMS-40 to see what the processor is capable of. Here are a few characteristics:

  • Quad 1.5 GHz ARM Cortex-A9 MP Cores
  • 96x fully-programmable StemCell Media Processing cores
  • 58 GFlops StemCell compute power

Qualcomm Snapdragon 600 & 800 (Adreno 320 & 330)

snapdragon-800-mdps

[infobox type=”information”]

Need a Snapdragon programmer? Hire us!

[/infobox]

There are two Adreno GPUs currently available known to have/get OpenCL support: the 320 and 330, respectively in the Snapdragon 600 and Snapdragon 800.

Qualcomm does not provide a developer’s board, but the Sony Xperia Z is known to have OpenCL. Other phones are expected to have drivers pre-installed too. That is interesting, as new phones with Adreno 330 are shipped soon, such as the LG Optimus G2 LS980, Sony Xperia Z Ultra and a version of the Samsung Galaxy S4.

Drivers are still in beta and are known to have bugs (as of April 2013). This discussion is the most interesting to follow, if you want keep up to date.

There are plenty of tools available, such as the Snapdragon SDK for Android and these Tools and Resources for the Adreno GPU. In the latter you’ll find OpenCL samples you can run too (it is a Windows-installer, for some vague reason, so MAC and Linux users need to do some extracting). You can start building the code from this project.

http://www.youtube.com/watch?feature=player_embedded&v=CaS0kpozyMM

Boards

Focus is on the more recent Snapdragon 800.

Inforce IFC6410 – Snapdragon 600

IFC6410websiteThe Ifc6410 is a $149 costing single-board computer with Adreno 320 and Qualcomm Snapdragon S4 Pro – APQ8064.

Datasheet (PDF)

Order here.

 

Bsquare Mobile Development Boards for Snapdragon 800

Processor: Quad-core Krait 400 CPU at up to 2.3GHz per core (Snapdragon 8974) , Adreno 330 GPU, Hexagon QDSP6 V5. A few highlights: wifi n/ac, bluetooth 4, USB 3.0, NFC, 1280x720p screen (tablet: 1920x1080p), 2GB 800MHZ memory, 12MP+2MP camera. It all runs on Android 4.2 (Jelly Bean), so no Linaro-packages. More info the Qualcomm MDB page and on this Qualcomm blog.

Phone form factor: $799 – Tablet: $1099 – Also check out Bsquare’s information page for these products, but be aware there are some links to the wrong PDFs.

Warning: you cannot call or use your provider’s internet with these devices! The word ‘phone’ only refers to the form factor.

DragonBoard Snapdragon APQ8060A for Snapdragon 800

Some highlight: Snapdragon 8074 quad core processor, 2GB of LPDDR3 RAM, 16GB of eMMC, Wi-Fi, Bluetooth, GPS, HDMI out and qHD LCD with capacitive multi touch, Adreno 330.

Can be ordered via http://mydragonboard.org/db8074/ for $499,-

DB8074_annotated_EAP_v1.1

Sony Xperia Z phones

OpenCL_SonyThe Xperia Z1 and Xperia Z Ultra have OpenCL support and drivers are ready-loaded. Go here for an introduction of OpenCL on these phones.

It needs the Android NDK to run the OpenCL programs.

Sony sees great advantages in using OpenCL on their mobile phones – from the website:

You can also see that the execution speed is much faster using OpenCL on the GPU when compared to the plain single threaded c-code running on the CPU (tested on Sony Xperia Z1). In addition to the speed benefit, you may also find that you decrease energy consumption by utilizing OpenCL on the GPU compared to using standard programming methods on the CPU.

Winning demo of Tokyo Demo Fest 2013 uses OpenCL

flagThe Tokyo Demo Fest 2013 is one of the many demo-parties around the globe. At such parties is where great programmers meet great artists and show off what came out of their collaborations.

The winner of this year used OpenCL to compute real-time procedurally generated geometries. For the rest C++, OpenGL and Gamemonkey Script was used.

http://vimeo.com/59398253

Tech features: curl noise, volumetric textures, Perlin noise, mesh deformations, HDR/bloom, film grain, fractals, Hermite splines, Tweens and quaternion iridescent particles.

The creator, Tokyo-based Eddie Lee, has done more projects – be sure to visit his homepage. I hope more demosceners start using the power of OpenCL to get more out of their demo’s.

Do you see where below kernel is used? Hint: check the other videos of Eddie.

__kernel void curlnel( 
                      __read_only image3d_t volume,  
                      sampler_t volumeSampler,  
                      __global float4 *output,  
                      float speed 
                      ) 
{ 
    uint index = get_global_id(0); 
    uint numParticles = get_global_size(0); 
    float indexNormalized = (float)index/numParticles; 

    // read from 3D texture 
    float4 vertData = output[index]; 

    float3 samplePos = vertData.s012; 
    samplePos = samplePos+(float3)(0.5f); 

    float4 voxel = (read_imagef(volume, volumeSampler, 
                   (float4)(samplePos,1.0f))); 

    vertData.s012 += voxel.xyz*speed; 

    output[index] = vertData; 
}

According to GPUVerify (see previous post) the line starting with “float4 voxel” has an error.