Apple Metal versus Vulkan + OpenCL 2.2

Metal
Metal – Apple’s me-too™ language

Update: C++ has been moved from OpenCL 2.1 to 2.2, being released in 2017. Title and text have been updated to reflect this. A reason why Apple released Metal might be the reason that Khronos was too slow in releasing C++ kernels into OpenCL, given the delays.

Apple Metal in one sentence: one queue for both OpenCL and OpenGL, using C++11. They now brought it to OSX. The detail they don’t tell: that’s exactly what the combination of Vulkan + OpenCL 2.1 2.2 does. Instead it is compared with OpenCL 1.x + OpenGL 4.x, which it certainly can compete with, as that combination doesn’t have C++11 kernels nor a single queue.

Apple Metal on OSX – a little too late, bringing nothing new to the stage, compared to SPIR and OpenCL 2.1 2.2.

The main reason why they can’t compete with the standards, is that there is an urge to create high-level languages and DSLs on top of lower-level languages. What Apple did, was to create just one and leaving out the rest. This means that languages like SYCL and C++AMP (implemented on top of SPIR-V) can’t simply run on OSX, and thus blocking new innovations. To understand why SPIR-V is so important and Apple should adopt for that road, read this article on SPIR-V.

khronos-SPIR-V-flowchart
Metal could compile to SPIR-V, just like OpenCL-C++ does. The rest of the Metal API is just like Vulkan.

Yet another vendor lock-in?

Now Khronos is switching its two most important APIs to the next level, there is a short-term void. This is clearly the right moment for Apple to take the risk and trying to get developers interested in their new language. If they succeed, then we get the well-known “pffff, I have no time to port it to other platforms” and there is a win for Apple’s platforms (they hope).

Apple has always wanted to have a different way of interacting with OpenCL-kernels using Grand Central Dispatch. Porting OpenCL between Linux and Windows is a breeze, but from and to OSX is not. Discussions over the past years with many people from the industry thought me one thing: Apple is like Google, Microsoft and NVidia – they don’t really want standards, but want 100% dedicated developers for their languages.

Yes, now also Apple is on the list of Me-too™ languages for OpenCL. We at StreamHPC can easily translate your code from and too Metal, but we would like it that you can put your investments in more important matters like improving the algorithms and performance.

Still OpenCL support on OSX?

Yes, but only OpenCL 1.2. A way to work around is to use SPIR-to-Metal translators and a wrapper from Vulkan to Metal – this will not make it very convenient though. The way to go, is that everybody starts asking for OpenCL 2.0 support on OSX forums. Metal is a great API, but that doesn’t change the fact it’s obstructing standardisation of likewise great, open standards. If they provide both Metal and Vulkan+OpenCL 2.1 2.2 then I am happy – then the developers have the choice.

Metal debuts in “OSX El Capitan”, which is available per today to developers, and this fall to the general public.

CUDA 6 Unified Memory explained

unified-mem
A) Unified Memory Access (UMA). B) NVIDIA’s Unified Virtual Addressing (UVA), now rebranded as “Unified Memory”.

AMD, ARM-vendors and Intel have been busy unifying CPU and GPU memories for years. It is not easy to design a model where 2 (or more) processors can access memory without dead-locking each other.

NVIDIA just announced CUDA 6 and to my surprise includes “Unified Memory”. Am missing something completely, or did they just pass their competitors as it implies one memory? The answer is in their definition:

Unified Memory — Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

The official definition is:

Unified Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

HPCGuru_not-shared-memSee the difference?

The image at the right explains it differently. A) is how UMA is officially defined, and B is how NVIDIA has redefined it.

So NVIDIA’s Unified Memory solution is engineered by marketeers, not by hardware engineers. On Twitter, I seem not to be the only one who had the need to explain that it is different from the terminology the other hardware-designers have been using.

So if it is not unified memory, what is it?

It is intelligent synchronisation between CPU and GPU-memory. The real question is what the difference is between Unified Virtual Addressing (UVA, introduced in CUDA 4) and this new thing.

UVA-1024x407

UVA defines a single Address Space, where CUDA takes care of  the synchronisation when the addresses are physically not on the same memory space. The developer has to give ownership to or the CPU or the GPU, so CUDA knows when to sync memories. It does need CudeDeviceSynchronize() to trigger synchronisation (see image).

CudeDeviceSynchronize

From AnandTech, which wrote about Unified (virtual) Memory:

This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.

So its to attract new developers, and then later taking care of them being bad programmers? I cannot agree, even if it makes GPU-programming popular – I don’t bike on highways.

From Phoronix, which discussed the changes of NVIDIA Linux driver 331.17:

The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA’s CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM.

So it is UVM 2.0, but without any API-changes. That’s clear then. It simply matters a lot if it’s true or virtual, and I really don’t understand why NVIDIA chose to obfuscate these matters.

In OpenCL this has to be done explicitly with mapping and unmapping pinned memory, but is very comparable to what UVM does. I do think UVM is a cleaner API.

Let me know what you think. If you have additional information, I’m happy to add this.

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

Stream Team at ISC

This year we’ll be with 4 people at ISC: Vincent, Adel, Anna and Istvan. You can find us at booth G-812, next to Red Hat.

Booth G-812 is manned&womened by Stream HPC

While we got known in the HPC-world for our expertise on OpenCL, we now have many years of experience in CUDA and OpenMP. To get there, we’ve focused a lot on how to improve code quality of existing software, to reduce bugs and increase speedup-potential. Our main expertise remains full control over algorithms in software – the same data simply processed faster.

Why do we have a booth?

We’ll be mostly talking to (new) customers for development of high performance software for the big machines. Also we’ll have a list of our open job positions with us, and we can do the first introductory interview on the spot.

Our slogan for this year is:

There are a lot of supercomputers. Somebody has to program its software

We’ll be sharing our week on Twitter, so you can also see what we find: posters about HPC-programming on CPU and GPU, booths that have nice demos or interesting talks and ofcourse the surprises.

Let’s meet!

If you don’t have an appointment yet, but would like to chat with us, please contact us or drop by at our booth. As we’re with four people, we have high flexibility.

Big Data

Big_Bang_Data_exhibit_at_CCCB_17Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

  • capture, data-curation & data-management,
  • analysis, search & querying,
  • sharing, storage & transfer,
  • visualization, and
  • information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

OpenCL in simple words

opencl-logoOur business is largely around making software faster. For that we use OpenCL, but do you know what this programming language is? Why can’t this speeding-up be done using other languages like Java, C#, C++ or Python?

OpenCL the answer to high-level languages, where we were promised superfast software that was very quick to write. After 20 years this was still a promise, as compilers had to guess too much what was intended. OpenCL gives the programmer more control in the places where more control is needed to get high-performing code and leave less guesses for the compiler.

It’s C with some extra power

It’s like normal C with three extra concepts, all with the aim to make the software run faster.

Explicit Data Transfer

In other introductions to OpenCL the data-transfers are mentioned as one of the last parts, but I find this the most important one. Reason: in most cases this is the main bottleneck in performance-targeted code.

When moving your stuff to another house, you pack all in boxes first before loading the truck. Or would you load each item into the truck one-by-one? Transport-costs would be much higher that way.

While it would be great that the fastest data-transfers should be done automatically, it simply doesn’t work like that. This means that designing the data-transfers is an important task when making fast software. OpenCL lets you do this.

Multiple cores

Most people have heard of “cores”, as made famous by Intel. Each core can do a part of a computation and effectively reduce runtime. OpenCL implements this by isolating the code that runs on each core – what goes in and out the protected code is done explicitly. This way the code is really easy to scale up to thousands of cores.

Would you choose the best-in-class to write the multiplication tables from 1 to 20, or have each student write one of them? Even though the slowest student will limit the rest, the total time is still lower.

Where a normal processor has 1, 2, 4 or 8 cores, a graphics processor has hundreds or even thousands of cores. OpenCL-software works on both.

Vectors

Modern processors can do computations on more than one data-item at the same time. They can be described as sub-cores. This means that each core has parallelism on its own.

When reading, do you read one word at once or character by character? Your brains can parse multiple characters at the same time.

OpenCL has support for “vectors” ( bundles of alike data) to be able to program these sub-cores.

It runs on many types of devices

OpenCL is famous for being the standard programming model for a lot of modern processors. There is no other programming language that can do the same. Support is available on:

  • CPUs; standard processors by Intel, AMD and ARM
  • GPUs; graphics cards by Intel, AMD and NVIDIA
  • FPGAs; processors that are programmed on the hardware-level, by Altera and Xilinx.
  • DSPs; digital signal processors by TI
  • Mobile graphics processors by ARM, Imagination, Qualcomm, etc.
  • See the rest of the list here.

This means that code can be ported to new devices in days or weeks instead of having to rewrite everything from scratch.

How does translating to OpenCL work?

When software needs to be faster, the first step is to find out its bottlenecks – these “hot spots” will be ported to OpenCL, while the rest remains the same. Then comes the hardest part: changing the algorithms such that data-transfers are more efficient and all cores are used. The last step is to look into low-level optimisations like the vectors.

Above is a very simplified representation of OpenCL. Still you’ve seen that the language is very unique and powerful. That will change, as its concepts are slowly getting embedded into existing languages – till then OpenCL is the only standard which fully enables all hardware features.

Let us do your peer-review

cuda-3-728There are many research papers that claim enormous speed-ups using an accelerator. From our experience a large part is because of code-modernisations (parallisation & optimisation), which makes the claim look false. That’s why we offer peer-reviews for half our rate for CUDA and OpenCL software. The final costs depend on the size and complexity of the code.

We will profile your CPU and Accelerator code on our machines and review the code. The results are the effect of the code-modernisations and the effect of using the accelerator (GPU, XeonPhi, FPGA). With this we hope that we stimulate the effect of code-modernization gets more research attention over using “miracle hardware”.

Don’t misunderstand: GPUs can still get an average of 8x speedup (or 700% speed improvement) over optimised code, which is still huge! But it’s simply not the 30-100x speed-up claimed in the slide at the right.

 

Basic concepts: malloc in the kernel

22489954_ml
Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,
cl_uint arg_index,
size_t arg_size,
const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Faster Development Cycles for FPGAs

normal-vs-opencl-fpga-flow
The time-difference between the normal and OpenCL flow is large. The final product is as fast and efficient.

VHDL and Verilog are not the right tools when it comes to developing on FPGAs fast.

  • It is time-consuming. If the first cycle takes 3 months, then each subsequent cycle easily takes 2 weeks. Time is money.
  • Porting or upgrading a design from one FPGA device to another is also time-consuming. This makes it essential to choose the final FPGA vendor and family upfront.
  • Dual-platform development on CPU and FPGA needs synchronisation. The code works on either the CPU or the FPGA, which makes the functional tests made for the CPU-version less trustworthy.

Here is where OpenCL comes in.

  • Shorter development cycles. Programming in OpenCL is normally much faster than in VHDL or Verilog. If you are porting C/C++ code onto FPGA the development cycles will be dramatically shorter. Think weeks instead of months – as this news article explains. This means a radically reduced investment as well as providing time for architectural exploration.
  • OpenCL works on both CPUs and FPGAs, so functional tests can be run on either. As a bonus the code can be optimised for GPUs, within a short time-frame.
  • The performance is equal to VHDL and Verilog, unless FPGA-specific optimisations are used, such as vector-widths not equal to a power of two.
  • Vendor Agnostic solution. Porting to other FPGAs takes considerably less time and the compiler solves this problem for you.
  • Both Xilinx and Altera have OpenCL compilersAltera were the first to come out with an OpenCL offering and have a full SDK, which is an add-on to Quartus II. Xilinx also have a stand-alone OpenCL development environment solution called SDAccel.

Support for OpenCL is strong by both Altera and Xilinx

Both vendors suggest OpenCL to overcome existing FPGA design problems. Altera suggest to use OpenCL to speed-up the process for existing developers. So OpenCL is not a third party tool, you need to trust separately.

OpenCL allows a user to abstract away the traditional hardware FPGA development flow for a much faster and higher level software development flow – Altera

Xilinx suggests that OpenCL can enable companies without the needed developer resources to start working with FPGAs.

Teams with limited or no FPGA hardware resources, however, have found the transition to FPGAs challenging due to the RTL (VHDL or Verilog) development expertise needed to take full advantage of these devices. OpenCL eases this programming burden – Xilinx

Why choose StreamHPC?

There are several reasons to choose letting us to do the porting and protoyping of your product.

  • We have the right background, as our team consists of CPU, GPU and FPGA developers. Our code is therefore designed with easy porting in mind.
  • Our costs are lower than having the product done in Verilog/VHDL.
  • We give guarantees and support for our products on all platforms the product is ported on.
  • We can port the final OpenCL code to Verilog/VHDL, keeping the same performance. In case you don’t trust a high-level language, we have you covered.
  • Optionally you can get both the code and a technical report with a detailed explanation of how we did it. So you can learn from this and modify the code yourself.
  • You get free advice on when (and not) to use OpenCL for FPGAs.

There are three ways to get in contact quickly:

Contact - call call: +31 854865760 (European office hours)

Contact - mail e-mail: contact@streamhpc.com

Fill in this form – mention when you want to be called back (possible outside normal office hours):

[contact_form]

Want to read more?

We wrote about OpenCL-on-FPGAs on our blog in the previous years.

Promotion for OpenCL Training (’12 Q4 – ’13 Q2)

So you want your software to be much faster than the competition?

In 4 days your software team learns all techniques to make extremely fast software.

Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

Want more information? Contact us.

About the training

Location and Time

OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

Some discounts are available for developing countries.

Agenda

Day 1: Introduction

Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

  • Introduction to parallel programming and GPU-programming
  • An overview of parallel architectures
  • The OpenCL model: host-programming and kernel-programming
  • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
  • Data-parallel and task-parallel programming
Lab-session will be an image-filter.
Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

Day 2: Tools and advanced subjects

Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

  • Static kernel analysis
  • Profiling
  • Debugging
  • Data handling and preparation
  • Theoretical backgrounds for faster code
  • Cache flow optimisation
Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

Day 3: Optimisation of memory and group-sizes

Learn the concept of “data-transport is expensive, computations are cheap”.
  • Register usage
  • Data-rearrangement
  • Local and private memory
  • Image/texture memory
  • Bank-conflicts
  • Coalescence
  • Prefetching
Lab-session: various small puzzles, which can be solved using the explained techniques.

Day 4: Optimisation of algorithms

Learn techniques to help the compiler make better and faster code.
  • Precision tinkering
  • Vectorisation
  • Manual loop-unrolling
  • Unbranching
Lab-session: like day 3, but now with compute-oriented problems.

Enrolment

When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

Put any remarks you have in the message. If you have any question, mail to trainings@streamhpc.com.

[si-contact-form form=’7′]

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor. Continue reading “Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks”

NVIDIA beta-support for OpenCL 2.0 works on Linux too

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM (optional feature) is not.
Device side enqueuePartially. Surprisingly, it
works better than
on Windows
Almost OpenCL programs with device side queue we have tested work.

Some advanced examples with multi-level device side kernel enqueuing
and/or CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP fail. When using device
side queue, it's only possible to use 1D nd-range with uniform work
groups (or without specifying local size). 2D and 3D nd-ranges
don't work.
Work-group functionsYes
PipesNoPipe functions are defined in libOpenCL.so in 378.13 drivers,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNo
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNo

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

The 8 reasons why our customers had their code written or accelerated by us

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us”

All OpenCL SDKs now in our Knowledge Base

For who hasn’t seen the latest addition to our knowledge base, we have added a list of all (almost) available OpenCL-SDKs. You can find it in the menu under “Knowledge Base” -> “SDKs…“.

This list shows how important OpenCL is getting, as developers now can write compute-intensive parallel software on CPUs, GPUs, ARM-based accelerators and even FPGAs. This growth of OpenCL-devices is very exciting and important news, and that’s why it has got its own section on the site.

The the current list is (in random order):

Currently looking into:

  • Intel Xeon Phi
  • Nintendo Wii U dev
  • Sony Playstation 4 Orbis
  • Vivante
  • Xilinx
  • NVidia GPUs
  • Qualcomm

The SDK of NVIDIA is on the second list, what you maybe did not unexpected. We have to wait until they have put their official statement on what they are going to do with CUDA and OpenCL.

While you are there, also check the other parts of the Knowledge Base:

  • What is… -> Explanations of terminology. Put your requests in a comment.
  • Event&Talks -> A list of events which StreamHPC attends, give talks at and helps organise. Interesting for both managers and engineers.
  • Self Study – The part of the site most visited after the blog. This is for the engineers who want to start learning programming GPUs.

This section will be updated and extended continuously with information not available anywhere else.

StreamHPC has been in the OpenCL business since 2010 as one of the few. We have been the most visible and known OpenCL-specialist ever since.

MediaTek’s partners deliver OpenCL on their phones

MediatekSeveral Chinese phones bring OpenCL to millions of users, as MediaTek offers their drivers to all phone vendors who use their (recent) chipsets.

Mediatek said that you just need a phone with one of the below chipsets and you can run your OpenCL-app, as they provide the driver-stack with the hardware to their customers. I’ve added a few phone names, but there is no guarantee OpenCL drivers are actually there. So be on the safe side and don’t buy the cheapest phone, but a more respected China-brand. Contact us if  you got a phone with the chipset that doesn’t work – then I’ll contact Mediatek. Share you experience with the chipset in the comments.

In case you want to use the phone for actual use, be sure it supports your 4G frequencies. Also check this Gizchina article on the below chipsets. There are more MediaTek-chipsets that support OpenCL, but not openly – they prefer to focus on their latest 64-bit series.

Important note on conformance: Mediatek is an adopter and does conform for a few processors. Of the ones listed below, only MT6795 is certain to have official support. Continue reading “MediaTek’s partners deliver OpenCL on their phones”

Adapteva Parallella board

A Paralllella board that has been delivered.
The 16-core Parallella board

[infobox type=”information”]

Need a Parallella programmer? Hire us!

[/infobox]

Adapteva is the creator of Parallella developer-boards – a board with an OpenCL-programmable grid-processor with FPGA. The $99 board consists out of:

[list1]

  • A board with a 16-core Epiphany-III
  • Full Epiphany SDK
  • OpenCL compiler based on libcoprthr.
  • A few code-samples (total number unknown).

[/list1]

A few of the features:

[list1]

  • 32 GFLOPS peak performance
  • 2 Watt TDP (chip only, whole board unkown)
  • USB plug-and-play ready
  • Remote access support

[/list1]

Information on their OpenCL compiler stack can be found here. Latest news can be received via Twitter/ParallellaBoard, or via their community forums.

The 64-core Epiphany-IV chip is also available since Q3 2014. More info later.

Double the performance on AMD Catalyst by tweaking subgroup operations

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations”

Engineering GPGPU into existing software

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software”

Call for Papers, Presentations, Workshops and Posters for IWOCL in Stanford

iwocl2015The IWOCL 2015 call for OpenCL Papers is now open and is looking for submissions from industry and academia relating to the use of OpenCL.  Submissions may refer to completed projects or those currently in progress and are invited in the form of:

  • Research Papers
  • Technical Presentations
  • Workshops and Tutorials
  • Posters

Examples of sessions from 2014 can be found here.

Deadlines at a Glance

Call for submissions OPENS: Wednesday 19th November, 2014
Call for submissions CLOSES: Saturday 14th February, 2015 (23:59 AOE)
Notifications: Within 4 weeks of the final closing date

Selection Criteria

The IWOCL Technical Committee will select submissions based on the following criteria;

  • Concept of the submission and its relevance and timeliness
  • Technical Depth
  • Clarity of the submissions; clearly conveying what your presentation
  • Research findings and results of your work
  • Your credentials and expertise in the subject matter

Unpublished Technical Papers

We solicit the submission of unpublished technical papers detailing original research related to OpenCL. All topics related to OpenCL are of interest, including OpenCL applications from any domain (e.g., scientific computing, video games, computer graphics, multimedia, information retrieval, optimization, text processing, data mining, finance, signal and image processing and numerical solvers), OpenCL performance analysis and modeling, OpenCL performance and correctness tools and proposed OpenCL extensions. IWOCL will publish formal proceedings of the accepted papers in The ACM International Conference Series. Please Submit an Abstract which should be between 1 and 4 pages long.

Technical Presentations

We solicit the submission of technical presentations detailing the innovative use of OpenCL. All topics related to OpenCL are of interest, including but not limited to applications, software tools, programming methods, extensions, performance analysis and verification. Please Submit an Abstract which should not exceed 4 pages.  The accepted presentations will be published in the online workshop proceedings.

Workshops & Tutorials

IWOCL includes a day of tutorials that provide OpenCL users an opportunity to spend more time exploring a specific OpenCL topic.  Tutorial submissions should assume working knowledge of OpenCL by the attendees and can for example cover OpenCL itself, any of the related APIs such as SPIR and SYCL, the use of OpenCL libraries or parallel computing techniques in general using OpenCL. Please Submit an Abstract which should not exceed 4 pages. Please include  the preferred length of the tutorial or workshop (e.g. 2, 3 or 4 hours).

Posters

To encourage discussion of the latest developments in the OpenCL community, there will be a poster session running in parallel to the main sessions and open during the breaks and lunch sessions.  The abstracts of the accepted posters will be published in the form of short communications in the workshop proceedings, provided that at least one of the authors has registered for the workshop. Please Submit an Abstract which should not exceed 2 pages.

Submit your abstract today

Go to Easychair, log in or register, and click on “New Submission”. Deadline is 14 February.

Intel promotes OpenCL as THE heterogeneous compute solution

opencl-intel-videoAt Intel they have CPUs (Xeon, Ivy Bridge), GPUs (Isis) and Accelerators (Xeon Phi). OpenCL enables each processor to be used to the fullest and they now promote it as such. Watch the below video and see their view on why OpenCL makes a difference for Intel’s customers.

This is important, because till recently Intel was more pushing OpenMP and their proprietary solutions. I think it has something to do with the specialised processors that can be programmed with OpenCL, such as DSPs and FPGAs. Intel has always made generic processors that solve problems best for most. Customers of OpenCL happen to be the ones that could not be served with generic processors and preferred FPGAs and DSPs, before they tried GPUs. By showing that Intel can do OpenCL, they show they are a trustworthy partner to handle the problems in a few years, when  the current problems can be handled by Intel processors.

Of course the Xeon Phi is also a good reason. The latest drivers have shown a huge improvement in performance, and that has increased Intel’s confidence in OpenCL for sure.

At StreamHPC we are very happy that Intel now openly promotes OpenCL and invests in it – this will increase trust in the programming language.

A small side-note. The differences between the Windows-drivers and Linux-drivers are somewhat vague: under Linux, the CPU is visible, but not supported officially. This makes development of multi-processor software not as straightforward as discussed in the video. Probably this will be more extensive in the future, as Intel only officially supports OpenCL on a processor when it’s very stable.