The most noticeable processors from NVIDIA, AMD and Intel

Posted by Vincent Hindriksen on 6 May 2016 with 1 Comment

AMD-Intel-NVidia 10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently.

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

Tesla K80 (Kepler)

The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
16 GB HBM2 (720 GB/s).
NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

0.5 TFLOPS SP (GPU), 1 TFLOPS HP
10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

32GB GDDR5 global memory
2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

Radeon Nano and FirePro S9300X2 (Fiji)

Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
Nano 175 Watt, S9300X2 300 Watt
Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

CPU with built-in GPU
About one TFLOPS
TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

55MB cache, 22 cores
AVX 2.0 (256 bit vector operations)
DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPhi Knights landing

Available in socket and PCI version
3 TFLOPS DP, 6 TFLOPS SP
AVX 512 (512 bit vector operations)
16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

Task-parallel processor
Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

CPU with built-in GPU
0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

ARM forums to find useful information for OpenCL development

Posted by Vincent Hindriksen on 11 November 2013

OpenCL on ARM is hot, but it just is getting started. Currently it takes some time to find needed information about the processors concerning

For OpenCL-discussions the best place is the Khronos OpenCL board. So where can you go when you want to ask questions specifically on ARM-based GPUS like MALI, PowerVR, Adreno and Vivante?

ARM’s new community site for all

ARM just launched the Connected Community (ARM CC). It is the place to connect to, when you have general information-needs of ARM-IP, such as ARM MALI, Cortex A9 and Cortex A15.

And here is how ARM themselves explains this initiative on one slide:

Be sure to connect to StreamHPC. We hope this will indeed be the central place for the whole ecosystem, including Imagination, Qualcomm and Vivante.

ARM MALI

The MALI Developer Center has its forums on ARM Connected Community.

Imagination PowerVR

The graphics-section of their developer forums seems to be the best place.

(Not @ ARM CC)

Qualcomm Adreno

Qualcomm has dev-forums too and has a section called Mobile Gaming & Graphics Optimization (Adreno™).

(Not @ ARM CC)

Vivante

Vivante does not have a forum, but Freescale does. The i.MX forums seem to be the best place to ask your questions.

@ARM CC

Others

Where do find a good source to find and share interesting information on mobile GPUs? Share it with the others via the comments – chances increase your questions gets answered when more people visit the forums.

USB-stick sized ARM-computers

Posted by Vincent Hindriksen on 18 April 2012

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers” →

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA

Posted by Vincent Hindriksen on 18 April 2015 with 2 Comments

Wrong! — In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.

A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).

Questions I try to answer:

Which server-grade GPUs support direct GPU-GPU communication when using OpenCL?
What are other characteristics interesting for OpenCL-devs besides direct communication GPU-GPU, GPU-FPGA, GPU-NIC?
How do you code such fast communication?

Enjoy reading! Continue reading “GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA” →

Altera published their OpenCL-on-FPGA optimization guide

Posted by Vincent Hindriksen on 11 November 2013

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

https://www.youtube.com/watch?v=p25CVFMc-dk

Subjects

The following subjects and optimisation tricks are discussed:

FPGA Overview
Pipelines
Good Design Practices
Avoid Pointer Aliasing
Avoid Expensive Functions
Avoid Work-Item ID-Dependent Backward Branching
Aligned Memory Allocation
Ensure 4-Byte Alignment for All Data Structures
Maintain Similar Structures for Vector Type Elements
Optimization of Data Processing Efficiency
Specify a Maximum Work-Group Size or a Required Work-Group Size
Loop Unrolling
Resource Sharing
Kernel Vectorization
Multiple Compute Units
Combination of Compute Unit Replication and Kernel SIMD Vectorization
Resource-Driven Optimization
Floating-Point Operations
Optimization of Memory Access Efficiency
General Guidelines on Optimizing Memory Accesses
Optimize Global Memory Accesses
Perform Kernel Computations Using Constant, Local or Private Memory
Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

GPU and FPGA challenge for MSc and PhD students

Posted by Vincent Hindriksen on 14 September 2017

While going through my email, I found out about the third “HiPEAC Student Heterogeneous Programming Challenge”. Unfortunately the deadline was last week, but just got an email: if you register by this weekend (17 September), you can still join.

EDIT: if you joined, be sure to comment in early November how it was. This would hopefully motivate others to join in next year. Continue reading “GPU and FPGA challenge for MSc and PhD students” →

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

For many “good enough”
Faster code, the easy way.
Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Better caching can give more boost than using faster hardware
Software running in parallel is a first step to GPU-computing
Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

The highest performance is guaranteed
Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
Code-review of CPU-code (Java, C, C++ and more).
Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

Find parallellizable computations
Give the fitness to run on GPUs
Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Architecture check-up
Data-transport measurements
Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

AMD ROCm 1.5 Linux driver-stack is out

Posted by Vincent Hindriksen on 3 May 2017

ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

About ROCm and it’s short history

The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

What does ROCm 1.5 bring?

Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

Other improvements:

Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
Added new API hipMemPtrGetInfo
Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

Bug fixes:

Fixed Copyright and header names
Fixed issue with bit_extract sample
Enabled lgamma and lgammaf
Added guard for GFX8 specific intrinsics
Fixed few issues with operator overloading of vector data types
Fixed atanf
Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

Current hardware support:

GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

Install ROCm 1.5

Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

You need to alter this line with every update, else it’ll keep using the old version.

Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

/opt/rocm/hsa/sample$ sudo make
gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Also clinfo (installed from the default repo) should work.

Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

Handling OpenCL with CMake 3.1 and higher

Posted by Vincent Hindriksen on 25 September 2015 with 2 Comments

There has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

make a build-directory
cd build
cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

Ubuntu/Debian: Get the PPA.
Other Linux: Get the latest tar.gz and compile.
Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

An introduction to Grid-processors: Parallella, Kalray and KnuPath

Posted by Vincent Hindriksen on 9 June 2016

grid We have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

single bus: like the PCIe-bus in a PC or the iMX6.
ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
star bus: a central communication core with the compute-cores around.
full mesh bus: each core is connected to each core.
grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

video encoding
signal processing
cryptography
neural networks

Continue reading “An introduction to Grid-processors: Parallella, Kalray and KnuPath” →

Overview of OpenCL 2.0 hardware support, samples, blogs and drivers

Posted by Vincent Hindriksen on 10 February 2015 with 6 Comments

We were too busy lately to tell you about it: OpenCL 2.0 is getting ready for prime time! As it makes use of the more recent hardware features, it’s therefore more powerful than OpenCL 1.x could ever be.

To get you up to speed, see this list of new OpenCL 2.0 features:

Shared Virtual Memory: host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.
Dynamic Parallelism: device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.
Generic Address Space: functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.
Improved image support: including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.
C11 Atomics: a subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.
Pipes: memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.
Android Installable Client Driver Extension: Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

I could write many articles about the above subjects, but leave that for later. This article won’t get into these technical details, but more into what’s available from the vendors. So let’s see what toys we were given!

A note: don’t start with OpenCL 2.0 directly, if you don’t know the basic concepts of OpenCL. Continue reading “Overview of OpenCL 2.0 hardware support, samples, blogs and drivers” →

Why this new AMD FirePro Cluster is important for OpenCL

Posted by Vincent Hindriksen on 14 November 2014

Then it hit the doormat:

“AMD is proud to collaborate with ASUS, the Frankfurt Institute for Advanced Studies, (FIAS) and GSI to support such important physics and computer science research,” said David Cummings, senior director and general manager, professional graphics, AMD. “This installation reaffirms AMD’s leading role in HPC with the implementation of the AMD FirePro S9150 server GPUs in this three petaFLOPS supercomputer cluster. AMD and ASUS are enabling OpenCL applications for critical science research usage for this cluster. We’re committed to building our HPC leadership position in the industry as a foremost provider of computing applications, tools and technologies.“

You read more here and the official news here.

Why is this important?

It could be that there is more flops for the same price, as AMD hardware is cheaper? Nice, but secondary.

That it runs OpenCL? We like that, but from a broader perspective this is not the most important.

It is important because it creates more diversity in the world of HPC. Currently there are a few XeonPhi clusters and only one big AMD FirePro S10000 cluster. The rest is NVidia Tesla or CPU only. With more AMD clusters the HPC market is democratised. That means that more software will be written in vendor-neutral software like OpenCL (with high-level software/libraries on top), and prices of HPC accelerators will not be kept high.

How to further democratise the HPC world?

We started with porting Gromacs to OpenCL, and we will continue to port large projects to OpenCL. This software will simply run on XeonPhi, Tesla and FirePro with just little porting time, reducing costs in many ways. We can not do it alone, but together we can. Start by telling us which software needs to be ported from OpenMP to OpenCL or OpenMP 4, or from CUDA to OpenCL. And if you are porting open source software to OpenCL, drop us a line for free advice and help with testing the software.

And the best you can do to break the monopoly of CUDA, is to simply buy AMD or Intel hardware. The price difference is enough to buy lots of extra FLOPS and to pay for a complete porting project to OpenCL of a large application.

Help write the book “Numerical Computations with GPUs”

Posted by Vincent Hindriksen on 24 September 2013

There is an interesting book coming up: “Numerical Computations with GPUs” – a book explaining various numerical algorithms with code in CUDA or OpenCL.

edit: At the moment there are 21 articles to be included in the book.

edit 2: book should be out in July

edit 3: Order via Springer International or Amazon US.
TOC:

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
A Guide to Implement Tridiagonal Solvers on GPUs.
Batch Matrix Exponentiation.
Efficient Batch LU and QR Decomposition on GPU.
A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems.
Sparse Matrix-Vector Product.
Solving Ordinary Differential Equations on GPUs.
GPU-based integration of large numbers of independent ODE systems.
Finite and spectral element methods on unstructured grids for flow and wave propagation problems.
A GPU implementation for solving the Convection Diffusion equation using the Local Modified SOR method.
Pseudorandom numbers generation for Monte Carlo simulations on GPUs: Open CL approach.
Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA.
GPU-Accelerated computation routines for quantum trajectories method.
Monte Carlo Simulation of Dynamic Systems on GPUs.
Fast Fourier Transform (FFT) on GPUs.
A Highly Efficient FFT Using Shared-Memory Multiplexing.
Increasing parallelism and reducing thread contentions in mapping localized N-body simulations to GPUs.

Continue reading “Help write the book “Numerical Computations with GPUs”” →

Kernels and the GPL. Are we safe and linking?

Posted by Vincent Hindriksen on 19 October 2011

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?” →

Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

development of extreme fast software,
design of (faster) algorithms,
accelerating existing software, and
provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

PDFs of Monday 12 September

Posted by Vincent Hindriksen on 12 September 2011

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza). 30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

How to install OpenCL on Windows

Posted by Anca Hamuraru on 16 March 2015 with 19 Comments

Getting your Windows machine ready for OpenCL is rather straightforward. In short, you only need the latest drivers for your OpenCL device(s) and you’re ready to go. Of course, you will need to add an OpenCL SDK in case you want to develop OpenCL applications but that’s equally easy.

Before we start, a few notes:

The steps described herein have been tested on Windows 8.1 only, but should also apply for Windows 7 and Windows 8.
We will not discuss how to write an actual OpenCL program or kernel, but focus on how to get everything installed and ready for OpenCL on a Windows machine. This is because writing efficient OpenCL kernels is almost entirely OS independent.

If you want to know more about OpenCL and you are looking for simple examples to get started, check the Tutorials section on this webpage.

Running an OpenCL application

If you only need to run an OpenCL application without getting into development stuff then most probably everything already works.

If OpenCL applications fail to launch, then you need to have a closer look to the drivers and hardware installed on your machine:

Check that you have a device that supports OpenCL. All graphics cards and CPUs from 2011 and later support OpenCL. If your computer is from 2010 or before, check this page. You can also find a list with OpenCL conformant products on Khronos webpage.
Make sure your OpenCL device driver is up to date, especially if you’re not using the latest and greatest hardware. With certain older devices OpenCL support wasn’t initially included in the drivers.

Here is where you can download drivers manually:

Intel has hidden them a bit, but you can find them here with support for OpenCL 2.0.
AMD’s GPU-drivers include the OpenCL-drivers for CPUs, APUs and GPUs, version 2.0.
NVIDIA’s GPU-drivers mention mostly CUDA, but the drivers for OpenCL ~~1.1~~ 1.2 are there too.

In addition, it is always a good idea to check for any other special requirements that the OpenCL application may have. Look for device type and OpenCL version in particular. For example, the application may run only on OpenCL CPUs, or conversely, on OpenCL GPUs. Or it may require a certain OpenCL version that your device does not support.

A great tool that will allow you to retrieve the details for the OpenCL devices in your system is Caps Viewer.

Developing OpenCL applications

Now it’s time to put the pedal to the metal and start developing some proper OpenCL applications.

The basic steps would be the following:

Make sure you have a machine which supports OpenCL, as described above.
Get the OpenCL headers and libraries included in the OpenCL SDK from your favourite vendor.
Start writing OpenCL code. That’s the difficult part.
Tell the compiler where the OpenCL headers are located.
Tell the linker where to find the OpenCL .lib files.
Build the fabulous application.
Run and prepare to be awed in amazement.

Ok, so let’s have a look into each of these.

OpenCL SDKs

For OpenCL headers and libraries the main options you can choose from are:

NVIDIA – CUDA Toolkit. You can grab the OpenCL samples here.
AMD – ~~AMD APP SDK. Also works with Intel’s CPUs.~~
- Headers and OpenCL.lib are here: https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases
- Samples are here: https://github.com/OpenCL/AMD_APP_samples
- Math libraries are here: https://github.com/clMathLibraries
Intel – the previous Intel SDK for OpenCL is now integrated into Intel’s new tools, such as Intel INDE (which has a free starters edition) or Intel Media Server Studio. Grab any of these in order to have everything ready for building OpenCL code.

As long as you pay attention to the OpenCL version and the OpenCL features supported by your device, you can use the OpenCL headers and libraries from any of these three vendors.

OpenCL headers

Let’s assume that we are developing a 64bit C/C++ application using Visual Studio 2013. To begin with, we need to check how many OpenCL platforms are available in the system:

[raw]

#include<stdio.h>
#include<CL/cl.h>

int main(void)
{
    cl_int err;
    cl_uint numPlatforms;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS == err)
         printf("\nDetected OpenCL platforms: %d", numPlatforms);
    else
         printf("\nError calling clGetPlatformIDs. Error code: %d", err);

    return 0;
}

[/raw]

We need to specify where the OpenCL headers are located by adding the path to the OpenCL “CL” is in the same location as the other CUDA include files, that is, CUDA_INC_PATH. On a x64 Windows 8.1 machine with CUDA 6.5 the environment variable CUDA_INC_PATH is defined as “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include”

If you’re using the AMD SDK, you need to replace “$(CUDA_INC_PATH)” with “$(AMDAPPSDKROOT)/include” or, for Intel SDK, with “$(INTELOCLSDKROOT)/include“.

OpenCL libraries

Similarly, we need to let the linker know about the OpenCL libraries. Firstly, add OpenCL.lib to the list of Additional Dependencies:

Secondly, specify the OpenCL.lib location in Additional Library Directories:

As in the case of the includes, If you’re using the AMD SDK, replace “$(CUDA_LIB_PATH)” with “$(AMDAPPSDKROOT)/lib/x86_64” , or in the case of Intel with “$(INTELOCLSDKROOT)/lib/x64“.

And you’re good to go! The application should now build and run. Now, just how difficult was it? Happy OpenCL-coding on Windows!

If you have any question or suggestion, just leave a comment.

Power to the Vector Processor

Posted by Vincent Hindriksen on 12 August 2011

Reducing energy-consumption is “hot”

After reading this article “Nvidia is losing on the HPC front” by The Inquirer which mixes up the demand for low-power architectures with the other side of the market: the demand for high performance. It made me think that it is not that clear there are two markets using the same technology. Also Nvidia has proven it to be not true, since the super-computer “Nebuale” uses almost half the watts per flop as the #1. How come? I quote The Register from an article of one year old:

>>When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)<<

Nvidia is (and should) be very proud. But actually I’m already looking forward when hybrids get more common. They will really shake up the HPC-market (as The Register agrees) in lowering latency between GPU and CPU and lowering energy-consumption. But where we can find a bigger market is the mobile market.

Continue reading “Power to the Vector Processor” →

http://www.flickr.com/photos/imabug/2946930401/

OpenCL Potentials: Medical Imaging

Posted by Vincent Hindriksen on 13 December 2010 with 2 Comments

When you ever saw a CT or MRI scanner, you might have noticed the full-sized computer next to it (especially the older ones). There is quite some processing power needed to keep up with the data-stream coming from the scanner, to process the data to a 3D-image and to visualise the data on a 2D-screen. Luckily we have OpenCL to make it even faster; which doctor doesn’t want real-time high-resolution results and which patient doesn’t want to see the results on Apple iPad or Samsung Galaxy Tab?

Architects, bankers and doctors have one thing in common: they get a better feeling for the current subject if they can play with the data. OpenCL makes it possible to process data much faster and thus let the specialist play with it. The interesting part of IT is that it is in every domain now and therefore a new series: OpenCL-potentials.

Continue reading “OpenCL Potentials: Medical Imaging” →

Creative industry

Desktops

When working with CAD-software, we tend to need a lot of rendering (ray-trace animations) and extensively use photo-editors. This means that there will be a huge speed-up when using OpenCL. In case of a little mistake in the final rendering, there should not be a question to ignore the detail or go for quality. Now you can just re-render and still make the deadline. In most cases the extra processing-power is needed for all employees and thus the best option is to upgrade the software. You can consult your software-supplier for more information.

But what is most interesting is to get the hardware upgraded. On Apple-computers, unluckily, the support for the stream-processors is lacking. We’ll just have to wait until NVidia and AMD listens to the growing group of OpenCL-demanding users on Apple-computers. Contact us, if you want to be the first to know when this will be possible!

On PCs we can upgrade the computers to have up to 4 stream-processors and thereby provide up to 5 teraFlops of computing power. This will result in real-time rendering of normal resolution images and 50 times speed-up over high-resolution images. We don’t think the efficiency will increase through less lost hours (because creativity happens inside the head when drinking coffee), but it will certainly increase the end-quality because more possibilities have been tried out and the creator was exposed to more visual feedback.

Render farms

When rendering movies and other high quality, high resolution visual material, a single Desktop might not be sufficient. Our default solution is a render-farm (a cluster of at least 5 servers and 1 control-computer) with drQueue. We’re familiar with those Pixar-movies stories that took 3 years to finish rendering. With OpenCL this can be brought back to 3 or 4 months, even with higher demands. Most movies with less complex materials (such as hair) can actually be rendered faster than real-time.