OpenCL.org internship/externship

intern
Our internship is according the description: it’s a rather complex homepage which should look good on your CV (if you manage to build it).

Want to help build an important website? OpenCL.org’s components have been designed and partly built, but still a lot of work needs to be done. We’re seeking an intern (or “extern” when not in Amsterdam) who can help us build the site. This internship is not about GPUs!

To complete the tasks, the following is required:

  • Technical expertise:
    • HTML5, CSS
    • PHP
    • Javascript
    • jQuery
    • Node.js
    • Mediawiki
    • XSLT
  • Can-do mentality
  • Able to plan own work
  • Good communication-skills
  • Available for 3 to 6 months

We don’t expect you know all tools, so we will guide you in learning new tools and techniques. Write us a “email of interest” to info@streamhpc.com, and write what you can and what your objectives for an internship would be.

We’re looking forward to see your letter!

PDFs of Monday 29 August

This is the first PDF-Monday. It started as I used Mondays to read up on what happens around OpenCL and I like to share with you. It is a selection of what I find (somewhat) interesting – don’t hesitate to contact me on anything you want to know about accelerated software.

Parallel Programming Models for Real-Time Graphics. A presentation by Aaron Lefohn of Intel. Why a mix of data-, task-, and pipeline-parallel programming works better using hybrid computing (specifically Intel processors with the latest AVX and SSE extensions) than using GPGPU.

The Practical Reality of Heterogeneous Super Computing. A presentation of Rob Farber of NVidia on why discrete GPUs has a great future even if heterogeneous processors hit the market. Nice insights, as you can expect from the author of the latest CUDA-book.

Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method (Thales Luis Rodrigues Sabino, Marcelo Zamith, Diego Brandâo, Anselmo Montenegro, Esteban Clua, Maurício Kischinhevksy, Regina C.P. Leal-Toledo, Otton T. Silveira Filho, André Bulcâo). GPU based cluster environment for the development of scalable solvers for a 3D wave propagation problem with finite difference methods. Focuses on scattering sound-waves for finding oil-fields.

Parallel Programming Concepts – GPU Computing (Frank Feinbube) A nice introduction to CUDA and OpenCL. They missed task-parallel programming on hybrid systems with OpenCL though.

Proposal for High Data Rate Processing and Analysis Initiative (HDRI). Interesting if you want to see a physics project where they did not have decided yet to use GPGPU or a CPU-cluster.

Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers (Naoya Maruyama, Tatsuo Nomura, Kento Sato and Satoshi Matsuoka). A collection of macros for GPGPU, tested on TSUBAME2.

AMD gDEBugger 6.2 for Linux

The printf-funtion in kernels isn’t the solution to everything, so hence profilers and debuggers specially tailored for GPU-programming. On Windows there is a lot of choice, but mostly only if you have a paid version of Visual Studio. On Linux you have GDB, but that program is not really user-friendly for the GUI-lovers.

For AMD there is now gDEBugger again available for Linux. Again, as version 5.8 by Gremedy worked with Linux, after AMD bought the company it got Windows-only for version 6. A few weeks ago, 10 months after 6.0, Linux-binaries got back with version 6.2. It supports OpenCL 1.2, OpenGL 3.2 and quite some extensions. As only AMD is supported, later more on debugging OpenCL-applications on NVidia and Intel.

Installation is quite straightforward. For creating a menu-item, you’ll find an useful image in /opt/gDEBugger6.2.xxx/tutorial/images/.

Continue reading “AMD gDEBugger 6.2 for Linux”

OpenCL at SC15 – the booths to go to

SC15This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.

Khronos

The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamHPC team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

  • Reflex | Booth: #3115.
  • BittWare | Booth #3010.
  • Nallatech | Booth #1639.
  • Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.

HPC OpenCL

Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.

Others

You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!

OpenCL Potentials: Investment-industry

This is the second in the series “OpenCL potentials“. I chose this industry because it is the finest example where you are always late, even if you were first. So it always must be faster if you want to make the better analyses. Before I started StreamHPC I worked for an investment-company, and one of the things I did was reverse engineering a few megabytes of code with the primary purpose of updating the documentation. I then made a proof-of-concept to show the data-processing could be accelerated with a factor 250-300 using Java-tricks only and no GPGPU. That was the moment I started to understand that real-time data-computation was certainly possible. Also that IO is the next bottle-neck after computional power. Though I am more interested in other types of research, I do have my background and therefore try to give an overview for this sector and why it matters.
Continue reading “OpenCL Potentials: Investment-industry”

Assessment of existing code-base’s quality

You found the main computation takes over 90% of the processing time, or you found the framework to be slow in general. We got in contact and discussed speeding up your software, after which we shared this assessment with you. This assessment should give you insights if the software-project is ready to be shared with Stream HPC.

The larger the code-base, the more important its code-quality

The first step is to prepare code for porting or optimization. As it’s not always easy to know what to do, we’ve defined 3 levels of code quality. The higher the quality-level, the fewer obstacles for the project and the lower the costs, the less communication is required and the fewer frustrations.

Preparing a project for porting / optimizing

The sections below discuss the 3 levels where a project can be. The goal is that you do a self-assessment and write down answers for each question with details, not only provide the final answer.

The code needs to have all levels marked in red: full level 1 and the high level of level 2. It does not matter if the existing code is written in Matlab, Python, C, C++, Assembly, OpenCL, CUDA or any other language.

The action points of all levels need to be done. When a project is not ready yet, we can assist in improving code-quality, but assume a lot has to be done by your team. This will be a separate (pre-)project, and the full estimation for the porting/optimization can only be done after that.

If not possible to level up the software or no source files are available, it will be handled as a black box project or R&D project. Do know that such projects can never be done fixed priced and are always unique. The generic part of the process is described in this blog – we are experienced in doing such focused R&D projects.

Level 1: Understandability

Goal: Can the software be understood without help from the main developers?

High level

  • Are the algorithms explained in i.e. scientific papers?
  • Do alternate implementations exist? I.e. in Python or Matlab.
  • Optional: is there a presentation/overview on the algorithm?

Mid level

  • Is there a software design document?
  • Is test-data provided that can be used to run the code?

Low level

  • Are all functions documented?
  • Is it clear what each (part of the) function is doing?
  • Are all variables documented?
  • Is it clear what each variable means?

Action points

  • A good communication-plan to get questions answered quickly. This includes a direct contact and regular calls.
  • Walk the code and improve/update the existing in-code documentation. Often it’s not looked at since it was written.

Level 2: Testability

Goal: Are there few bugs and can new bugs be easily detected?

High level

  • Is there a golden standard of the standard, such like a proof-of-concept or the existing code? Is it available to us?
  • Are the outputs deterministic? Or can they be made deterministic quickly?
  • Is there a good understanding where the algorithm is less stable to unstable, and can this be explained?
  • Is there clarity on the required maximum quantitative errors that would define the correctness of output? Are there high level test-cases for all of these?
  • Is there clarity on required maximum qualitative errors that would define the correctness of output? Is the collection of examples and counter-examples large enough to define the error in full?
  • Can the whole library tested using the sample input and output?
  • Is there clarity on the required precision? When is an answer correct?

Mid Level

  • Does the CI contain static code analysis?
  • Are there test-cases or automated tools for finding qualitative errors?

Low level

  • Are the compute-intensive functions covered by functional tests?
  • Are the most important functions covered by functional tests?
  • Are the other functions covered by functional tests?

Action points

  • Making the code deterministic by temporary removing random number generator.
  • Define what is correct output in detail.
  • Complete the test cases for the different types of errors.
  • Creating several sets of data-in and its correct data-out. This will be used for acceptance of ported code, giving the maximum errors allowed.
  • Decide which sub-results need to be compared.

Level 3: Quality

Goal: Is the software easy to maintain and extend?

High level

  • Is there build-automaton, like Cmake or Meson?
  • Is the code multi-OS?
  • Are tests run with every commit, or at least daily/nightly?

Mid level

  • Are functions defined to only do one thing and do it well?
  • Nobody of the team labelled the code as “Spaghetti code”?
  • Are function names self-explanatory?
  • Are variable-names self-explanatory?

Low level

  • There is no duplicated code?
  • There is no code that actually can be made much less complex?
  • There are no functions or methods longer than 100 lines excluding documentation?
  • There are no large classes?
  • There are no functions or methods with far more than 7 parameters?
  • There is only few commented out code?
  • There is only few dead code? This is code that is not being called from anywhere.
  • There is no code that should be rewritten?
  • There is no variables that are being reused?
  • There is no significant dependence on global state?

Action points

  • Prepare the (new) GPU-code for continuous integration.
  • Document how the ported code maps to the existing code.

Costs Assessment

Code at level 0 can take 10 – 20 times more time than code at level 1. How much exactly is difficult to say, and that’s exactly the reason we require a minimal level. The bitter pill is that the costs of not cleaning up the code is often even higher due to increased hardware costs, increased maintenance costs and lack of innovation.

There is only one reason to keep code quality under level 1: when it’s going to be replaced within a month.

When level 1 is mostly done, each missing item can take 20% to 40% of extra costs. A good example is not having good test-cases or no CPU-code available or not even an executable. This adds costs in both the implementation phase and the acceptance phase. Making a new CPU-implementation first is often cheaper.

From the described minimal level (level 1 in full, level 2 the high level) costs are more predictable and less costly. When getting a quote, these can be requested to be mentioned and you can choose to do it yourself or let us do it.

Contact

Just discuss your goals, after done an assessment. We can guide you in prioritization of making your code ready for porting.

Email to info@streamhpc.com or look to the other contact options on the contact page here. Then we’ll schedule a call and discuss what you need.

If you find this self-assessment useful, know it took us a long time to improve this document and a lot of experience is hidden in it. But as we find quality software very important, we’re releasing this list under CC-BY-NC-ND: it cannot be altered or used commercially, and it must have a clear reference to Stream HPC as the authors. In other words: it must be clear that you did not do the research or writing of this assessment yourself.

If you have feedback or suggestions, we’d really like to hear from you!

AMD GPUs & CPUs

[infobox type=”information”]

Need a programmer for OpenCL on AMD FirePro, Radeon or APU? Hire us!

[/infobox]

AMD has support for all their recent GPUs and CPUS, and has good performance on products starting from 2010/2011:

[list1]

[/list1]
AMD does not provide a standard SDK kit which contains both hardware and software, as their hardware is available at many computer-shops.

SDK

The OpenCL SDK (software) needs to be downloaded in several steps:

[list1]

[/list1]

CodeXL replaces the following software in de AMD APP software family:

[list1]

[/list1]
These are still available for download.

Training

There is (free) training material available:

[list1]

[/list1]

Other AMD software for OpenCL

The APP Math Libraries contain FFT and BLAS functions optimised for AMD GPUs.

OpenCL-in-Java can be done using Aparapi.

IWOCL 2017 – all the talks

An overview of all the tutorials and talks for easy reading.

You can also download the PDF.

Heterogeneous Computing Using Modern C++ with OpenCL Devices – Rod Burns and Ruyman Reyes (Codeplay)

This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.

SYCL is already able to dispatch to heterogeneous devices and it implements C++17 ParallelSTL, augmenting it with ability to dispatch to GPUs in addition to CPUs. This tutorial will demonstrate how to write parallel SYCL code and how to use the Khronos Group’s experimental Parallel STL implementation. The course outline is as follows

  • Start with a basic SYCL program that shows how to submit queues in a single task and stream-like object, comparing CPU, SYCL and OpenCL versions
  • Demonstrate how to access data across host and GPUs using buffers and accessors, the importance of life-time, and basic parallel constructs

Attendees are expected to have programming experience with C++ and a laptop either running Linux or having a VM manager installed such as VirtualBox. The required software will be provided on USB-sticks. This course is suitable for beginners, but is focused on intermediate to advanced parallel programming using C++.

Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL- Byron Sinclair, Andrew Ling and Genady Paikin (Intel)

In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs).

You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel® FPGA SDK for OpenCL, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA.

Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.

Unlock Intel GPUs for High Performance Compute, Media and Computer Vision Capabilities with Intel OpenCL Extensions – Jeff Mcallister, Biju George, Adam Herr and Ben Ashbaugh (Intel)

The keys to unlock the full performance potential of Intel GPUs for emerging workloads in general compute, media, computer vision, and machine learning are in the rich suite of Intel OpenCL extensions. These give developers direct access to unique Intel hardware capabilities, which until now have been difficult to master.
This tutorial builds step by step with multiple examples, including:

  • How to write high performance general compute applications based on the core concept of OpenCL subgroups.
  • How to use additional subgroup operations described in the Intel subgroups and media block read/write extensions.
  • Then using the framework of subgroups, we explain the device-side motion estimation extension which leverages the unique Intel GPU media sampler to accelerate motion estimation operations from OpenCL kernels.
  • Finally we explain the Video Enhancement (VEBOX) extension, which is an OpenCL host level API extension to leverage a powerful media fixed function unit to accelerate many frame level video enchancement operations.

Faster, smarter computer vision with AI and OpenCL – Uri Levy and Jeffrey Mcallister (Intel)

Learn how to use Intel machine learning and computer vision tools to get from concept to market faster for machine learning applications based on OpenCL and OpenVX. Build two example scenarios: autonomous driving with FPGA inference and a smart camera app using Intel Graphics inference. This presentation will show how a unified set of tools can reduce the complexity of developing heterogeneous machine learning apps – from training a model with input images, to creating a custom classifier, to building an optimized traditional computer vision pipeline around the classifier to create a full computer vision application

GPGPU Acceleration using OpenCL for a Spotlight SAR Simulator – Eric Balster, Jon Skeans and David Fan (University of Dayton) Marc Hoffman (US Air Force Research Laboratory)

In this paper, OpenCL is used to target a general purpose graphics processing unit (GPGPU) for acceleration of 2 modules used in a synthetic aperture radar (SAR) simulator. Two of the most computationally complex modules, the Generate Return and Back Projection modules, are targeted to an AMD FirePro M5100 GPGPU. The resulting speedup is 2.5X over multi-threaded C++ implementations of those algorithms running on an 8-core Intel I7 2.8GHz processor, 5X over singlethreaded C++ implementations, and 24X over native MATLAB implementations, on average.

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL – Javier Alejandro Varela and Norbert Wehn (University of Kaiserslautern)

In this work, we exploit OpenCL to efficiently map the nested simulation of complex portfolios with multiple algorithms on heterogeneous computing systems. Code portability and customizations allow us to profile the kernels on different accelerating platforms, such as CPU, Intel’s Xeon Phi and GPU. The combination of OpenCL, a new bit-accurate algorithmic optimization and the extension of an existing numerical interpolation scheme allows us to achieve 1000x speedup compared to the state-of-the-art approach. Our system design minimizes costly host-device transfers and global memory, enabling complex portfolios to be easily scaled.

A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking – Leonardo Solis Vasquez and Andreas Koch (Technische Universität Darmstadt)

This work presents an OpenCL implementation of AutoDock, and a corresponding performance evaluation on two different platforms based on multi-core CPU and GPU accelerators. It shows that OpenCL allows highly efficient docking simulations, achieving speedups of ∼4x and ∼56x over the original serial AutoDock version, as well as energy efficiency gains of ∼2x and ∼6x. respectively. To the best of our knowledge, this work is the first one also considering the energy efficiency of molecular docking programs.

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations – Nuno Fachada and Agostinho Rosa (Instituto Superior Técnico, Portugal)

In this paper we evaluate the feasibility of using CPU-oriented OpenCL for high-performance simulations of agent-based models. We compare a CPU-oriented OpenCL implementation of a reference ABM against a parallel Java version of the same model. We show that there are considerable gains in using CPU-based OpenCL for developing and implementing ABMs, with speedups up to 10x over the parallel Java version on a 10-core hyper-threaded CPU.

Enabling FPGAs as a True Device in the OpenCL Standard – Vincent Mirian and Paul Chow (University Of Toronto)

As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.

Applying Models of Computation to OpenCL Pipes for FPGA Computing – Nachiket Kapre and Hiren Patel (University of Waterloo)

We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as SDF (synchronous dataflow), bulk synchronous (BSP), or Discrete Event (DE). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We hope to facilitate a preliminary analysis and evaluation of supporting these patterns in OpenCL and quantifying associated FPGA implementation costs.

Accelerating Applications at Cloud Scale using FPGAs – Sarah Siripoke, Fernando Martinez Vallina and Spenser Gilliland (Xilinx)

The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.

Creating High Performance Applications with Intel’s FPGA OpenCL SDK – Andrew Ling, Utku Aydonat, Davor Capalija, Shane O’Connell and Gordon Chiu (Intel)

After decades of research, High-Level Synthesis has finally caught on as a mainstream design technique for FPGAs. However, achieving performance results that are comparable to designing at a hardware description level still remains a challenge. In this talk, we illustrate how we achieve world class performance results on HPC applications by using OpenCL. Specifically, we show how we achieve 1Tflop of performance on a matrix multiply and over 1.3Tflops on a CNN application, run on Intel’s 20nm Arria 10 FPGA device. Finally, we will describe spatial coding techniques that lead to efficient structures, such as systolic-arrays, to ensure that the FPGA runs efficiently.

Symphony – Task Scheduling and Memory Management in Heterogeneous Computing – Amit Jindal and Wenjia Ruan (Qualcomm Technologies)

Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK – a framework that makes workload partitioning, scheduling and memory management ‘simple’ for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.

CUDA-on-CL: A compiler and runtime for running modern CUDA c++11 applications on OpenCL 1.2 devices – Hugh Perkins (ASAPP)

Cuda-on-cl addresses the problem of creating and maintaining OpenCL forks by leaving the reference implementation entirely in NVIDIA CUDA, and writing both a compiler and a runtime component, so that any CUDA c++11 application can in theory be compiled and run directly on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run Tensorflow and Eigen kernels directly, with no modification to the original CUDA source-code. Performance studies are also undertaken, and show that the cuda-on-cl program runs at about 25% of the original CUDA-compiled version.

OpenCL in Scientific High Performance Computing—The Good, the Bad, and the Ugly – Matthias Noack (Zuse Institute Berlin)

We present experiences with utilising OpenCL alongside C ++ , MPI, and CMake in two real-world scientific codes. Our targets are a Cray XC40 supercomputer with multi- and many-core (Xeon Phi) CPUs, as well as multiple smaller systems with Nvidia and AMD GPUs. We shed light on practical issues arising in such a scenario, like the interaction between OpenCL and MPI, discuss solutions, and point out current limitations of OpenCL in the domain of scientific HPC from an application developer’s and user’s point of view.

Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices – Andrew Richards, Mehdi Goli and Luke Iwanski (Codeplay)

Codeplay has been working with Google to add SYCL back-end support in TensorFlow, one of the most popular machine learning frameworks, enabling developers to use OpenCL devices with their machine learning applications. SYCL provides an abstraction layer that simplifies parallel development, giving developers access to the computing power of OpenCL devices and reducing the amount of code required. Andrew Richards will talk about how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.

Analyzing and improving performance portability of OpenCL applications via auto-tuning – James Price and Simon McIntosh-Smith (University of Bristol)

In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.

Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms – Biju George and Ben Ashbaugh (Intel)

In this presentation we focus on the application of the wavefront pattern to design efficient GPGPU implementations of video encoding algorithms using OpenCL kernels. We present our experiences in implementing and evaluating four solutions of WPP for inter and intra estimation for AVC on GPUs. We explain the reasoning behind each solution and present the results of our analysis.

Challenges and Opportunities in Native GPU Debugging with OpenCL – Uri Levy (Intel)

In this technical session we’ll present the open architectural design of the debugger and how it fits into the OpenCL JIT compilation flow and the underlying compute technology of the HW with focus on Intel processor graphics. We’ll demonstrate a show case on how to natively work with the debugger to solve functional bugs, as-well-as low-level debugging techniques on SIMD thread level which help to solve complex issues such as misaligned or out of range accesses to local\global memory, stack overflows, Illegal instructions, etc. Finally, we’ll cover the challenges in debugging

Modeling Explicit SIMD Programming with Subgroup Functions – Biju George and Ben Ashbaugh (Intel)

In this presentation, based on our experience in developing publicly released vendor extensions based on subgroups, we explain the advantages of the “explicit SIMD” programming paradigm using OpenCL subgroup and how the subgroups framework can be leveraged to: (1) Model features for performance in OpenCL that are commonly available in programming languages or interfaces based on an “explicit SIMD” programming paradigm such as the AVX intrinsics supported in GCC; and to (2) Model features to expose functionality available in GPU accelerator units that are more conveniently and efficiently exposed using a block API.

An OpenCL-on-FPGAs presentation in a bar

What do you do when you want to explain OpenCL and FPGAs and OpenCL-on-FPGAs to a beer drinking crowd in just 15 minutes? Well, you simply can’t go deep into the matter. On a Thursday evening, 5 November 2015,  I was standing on a chair for a beer-loving group of Hackers and Founders with my laser-powered presenter, trying not to loose everybody. It was not the first time I stood on that particular chair – some years ago I presented about OpenCL-on-GPUs.

Below you find the full presentation – feel free to use and change the slides for yourself (PDF here).

Do you want us to present OpenCL, accelerators or performance engineering in a talk tailored for your audience? Just give us a call.

X86-workstation buying guide for OpenCL developers, Q1 2013

Curved iMac has your back…
Nuno Teixeira designed a large curved monitor in 2008 and assumed it would never be made. For a “few” thousand dollar NEC offers one to you right now. Also Samsung and LG have announced several new curved TVs at CES 2013 (with hdmi-port). We only need a workstation to go with it, where this blog-article might come in handy.

Important: this article was written before Intel “Haswell” and AMD “Richland” architectures came out.

So you want to start developing for OpenCL? When you focus on developing OpenCL for X86, you have these three options: CPUs, GPUs and CPUs with and embedded GPU. This article is for you and represents the current state of hardware – if you want the best hardware for your specific algorithm, the below information is probably not sufficient.

In 2013 we focus on 3 groups: servers/cloud (FirePro, Tesla, XeonPhi), workstations (discussed here), low-power devices (SoCs) and special accelerators (FPGAs and DSPs). This article does not discuss high-end accelerators of a few thousands of Euro, which are laid out in here.

Before reading on, you need to set the goal for your workstation.

  • If you want to learn the basics of OpenCL-programming, first check if your current machine has OpenCL-support.
  • If you need more processing power, be sure you select the right hardware for the job. Don’t buy the most expensive hardware (FirePro, Tesla or XeonPhi), but take your time to find out which hardware supports your algorithms best. Feel free to ask us.
  • If you want to make sure your software works on various types of accelerators, you can choose between:
    • swapping PCIe-cards – disadvantage is the drivers-hazzle and time-consumption.
    • more accelerators in one machine – disadvantage is that only GPU 1 can do OpenGL/DirectX.
    • identical machines with different accelerators – disadvantage is the price.
  • If you want to focus on multi-GPU development, you need:
    • or enough power-supply and the motherboard supports many lanes,
    • or buy a videocard with two GPUs.

This article has the goal to help you with buying a good machine for OpenCL-development. Prices are of January 2013. If you think I make the wrong suggestions, please give feedback via the comments.

My contacts at various companies can tell: I want to stay independent no matter what. No deals have been made nor was there any outside influence, except the friendly people of the local computer shops. I was surprised I ended up with suggestion so much AMD hardware, that I felt quite uncomfortable with it – I finally decided to keep to my first conclusions and leave the comments completely open.

Continue reading “X86-workstation buying guide for OpenCL developers, Q1 2013”

Google blocked OpenCL on Nexus with Android 4.3

renderscript-eats-openclImportant: this is only for Google-branded Nexus phones – other brands are free to do what they want, and they most-probably will.

Also important: this doesn’t mean that OpenCL on Android devices will be over, but that there is a bump in the road now Google tries to lock-in customers to their own APIs.

The big idea behind OpenCL is that higher level languages and libraries can be built on top of it. This is exactly what was done under Android: RenderScript Compute (a higher-level language)  was implemented using OpenCL for ARM Mali GPUs.

Having OpenCL drivers on Android has several advantages, such that OpenCL can directly be used on Android and that there is room for other high-level languages that have OpenCL as back-end. Especially the latter is what probably made Google decide to cripple the OpenCL-drivers.

Google seems to be afraid of competition, and that’s a shame, as competition is the key factor that drives innovation. The OpenCL community is not the only one complaining about Google’s intentions concerning Android. Read page 3 of that article to understand how Google is controlling handset-vendors and chip-makers.

Google’s statement

In February OpenCL drivers were discovered on two Nexus tablets using a MALI T604 GPU. Around the same time there was one public answer from Google employee Tim Murray (twitter) why Google did not want to choose OpenCL: Continue reading “Google blocked OpenCL on Nexus with Android 4.3”

OpenCL integer rounding in C

Square_rounding
Square pant rounding can simply be implemented with “return (NAN);“.

Getting about the same code in C and OpenCL has lots of advantages, when maximum optimisations and vectors are not needed. One thing I bumped into myself was that rounding in C++ is different, and decided to implement the OpenCL-functions for rounding in C.

The OpenCL-page for rounding describes many, many functions with this line:

destType convert_destType<_sat><_roundingMode>(sourceType)

So for each sourceType-destType combination there is a set of functions: 4 rounding modes and an optional saturation. Easy in Ruby to define each of the functions, but takes a lot more time in C.

The 4 rounding modes are:

Modifier Rounding Mode Description
_rte Round to nearest even
_rtz Round towards zero
_rtp Round toward positive infinity
_rtn Round toward negative infinity

The below pieces of code should also explain what the functions actually do.

Round to nearest even

This means that the numbers get rounded to the closest number. In case of 3.5 and 4.5, they both round to the even number 4. Thanks for Dithermaster, for pointing out my wrong assumption and clarifying how it should work.

inline int convert_int_rte (float number) {
   int sign = (int)((number > 0) - (number < 0));
   int odd = ((int)number % 2); // odd -> 1, even -> 0
   return ((int)(number-sign*(0.5f-odd)));
}

I’m sure there is a more optimal implementation. You can fix that in Github (see below).

Round to zero

This means that positive numbers are rounded up, negative numbers are rounded down. 1.6 becomes 1, -1.6 also becomes 1.

inline int convert_int_rtz (float number) {
   return ((int)(number));
}

Effectively, this just removes everything behind the point.

Round to positive infinity

1.4 becomes 2, -1.6 becomes 1.

inline int convert_int_rtp (float number) {
   return ((int)ceil(number));
}

Round to negative infinity

1.6 becomes 1, -1.4 becomes 2.

inline int convert_int_rtp (float number) {
   return ((int)floor(number));
}

Saturation

Saturation is another word for “avoiding NaN”. It makes sure that numbers are between INT_MAX and INT_MIN, and that NaN returns 0. If not used, the outcome of the function can be anything (-2147483648 in case of convert_int_rtz(NAN) on my computer). Saturation is more expensive, so therefore it’s optional.

inline float saturate_int(float number) {
  if (isnan(number)) return 0.0f; // check if the number was already NaN
  return (number>MAX_INT ? (float)MAX_INT : number

Effectively the other functions become like:

inline int convert_int__sat_rtz (float number) {
   return ((int)(saturate_int(number)));
}

Doubles, longs and getting started.

Yes, you need to make functions for all of these. But you could ofcourse also check out the project on Github (BSD licence, rudimentary first implementation).

You’re free to make a double-version of it.

How many threads can run on a GPU?

5x5x3x3x3
Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15

mont-blancAccording to the latest newsletter of the Mont-Blanc Project, it was explained that the GPU on a Samsung Exynos 5 is much faster and greener than its CPU: 3.5 times faster with half the energy. They built a supercomputer using 810 Exynos SoCs, that can deliver a 26 TFLOPS of peak performance. With the upcoming mobile GPUs becoming exponentially faster, they have all the expertise to build an even faster and greener ARM-supercomputer after this.

The Mont-Blanc compute cards deliver considerably higher performance; at 50% lower energy consumption, compared with previous ARM-based developer platforms.

The Mont-Blanc prototype is based on the Samsung Exynos 5 Dual SoC, which integrates a dual-core ARM Cortex-A15 and an on-chip ARM Mali-T604 GPU, and has been featured and market proven in advanced mobile devices. The dual-core ARM Cortex-A15 delivers twice the performance of the quad-core ARM Cortex-A9, used in the previous generation of ARM-based prototype, whilst consuming 20% less energy for the same workload. Furthermore, the on-chip ARM Mali-T604 GPU provides 3.5 times higher performance than the dual-core Cortex-A15, whilst consuming half the energy for the same workload.

Each Mont-Blanc compute card integrates one Samsung Exynos 5 Dual SoC, 4 GB of DDR3-1600 DRAM, a microSD slot for local storage and a 1 GbE NIC, all in an 85x56mm card (3.3×2.2 inches). A single Mont-Blanc blade integrates fifteen Mont-Blanc compute cards and a 1 GbE crossbar switch, which is connected to the rest of the system via two 10 GbE links. Nine Mont-Blanc blades fit into a standard BullX 9-blade INCA chassis. A complete Mont-Blanc rack hosts up to six such chassis, providing a total of 1620 ARM Cortex-A15 cores and 810 on-chip ARM Mali-T604 GPU accelerators, delivering 26 TFLOPS of peak performance.

“We are only scratching the surface of the Mont-Blanc potential”, says Alex Ramirez, coordinator of the Mont-Blanc project. “There is still room for improvement in our OpenCL algorithms, and for optimizations, such as executing on both the CPU and GPU simultaneously, or overlapping MPI communication with computation.”

Continue reading “ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15”

Keep The Hardware Focus

The real Apu

If you buy a car, the first choice is not often the kind of fuel. You first select on the engine-properties, the looks, the interior, the brand and for sure the total cost of ownership. The costs can be a reason to choose for a certain type of fuel though. In the parallel computation world it is different. There the fuel (CUDA or OpenCL) is the first decision and then the hardware is chosen. I think this is wrong and therefore speak a lot about CUDA-vs-OpenCL, while I think NVidia is a good choice for a whole list of algorithms.

If we give advise during a consult, we want to give the best advice. In case of CUDA, that would be based on budget to go for Tesla or the latest GTX; in case of OpenCL we can give much better advice on hardware. But actually starting with the technique is the worst thing you can do: focus on the hardware and then pick the technique that suits best.

IMPORTANT. The following is for understanding some concepts and limits only! It is pure theoretically, so I don’t claim any real-world results. Also what not is taken into account is how well different processors handle control-instructions (for, while, if, case, etc), which has quite some influence on actual performance.

Continue reading “Keep The Hardware Focus”

OpenCL on Altera FPGAs

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Continue reading “OpenCL on Altera FPGAs”

NVIDIA: mobile phones, tablets and HPC (cloud)

If you want to see what is coming up in the market of consumer-technology (PC, mobile and tablet), then NVIDIA can tell you the most. The company is very flexible, and shows time after time it really knows in which markets is currently operates and can enter. I sometimes strongly disagree with their marketing, but watch them closely as they are in the most important markets to define the near future in: PCs, Mobile/Tablet and HPC.
You might think I completely miss interconnects (buses between processors, devices and memory) and memory-technologies as clouds have a large need for high-speed data-transport, but the last 20 years have shown that this is a quite stable developing market based on IP-selling to the hardware-vendors. With the acquisition of Cray’s interconnect technology, we have seen this is serious business for Intel, so things might change indeed. For this article I want to focus on NVIDIA’s choices.

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

LEAP-conference call for papers

921752_m
Building bridges in a new industry

Embedded processors always have had the focus on low-energy. Now a combination of Moore’s law, the frequency-wall and multi-processor developments have made it possible for these processors to compete in completely new market segments. Most notable due to impressive advancements in graphics IP.
We are now looking at four groups who are interested in learning from each other:

  • The embedded processor market
  • The FPGA market
  • The HPC and server market
  • The GPGPU market

And answer the question: how can we get more out of low-energy processors by looking at other industries?

The goal of the LEAP conference is to bring these three groups together. Creating the windows to each other and paving roads over the newly constructed bridges. This makes it one of its kind. Half of the conference is focused on quality information sharing and the other half on networking. For more information, check the website of the LEAP-conference. StreamHPC is a co-organiser.

Call for papers is now open! Programme is filled!

Continue reading “LEAP-conference call for papers”

DHPCC++ Program known

During IWOCL a workshop takes place that discusses the opportunities that C++ brings to OpenCL-enabled processors. A well-know example is SYCL, but various other approaches are talked about.

The Distributed & Heterogeneous Programming in C/C++ Workshop just released their program:

  • HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking.
  • SYCL C++17 and OpenCL interoperability experimentation with triSYCL.
  • KART – A Runtime Compilation Library for Improving HPC Application Performance.
  • Using SYCL as an Implementation Framework for HPX Compute.
  • Adding OpenCL to Eigen with SYCL.
  • SYCL-BLAS: Leveraging Expression Trees for Linear Algebra.
  • Supporting Distributed and Heterogeneous Programming Models in ISO C++.
  • Towards an Asynchronous Data Flow Model for SYCL 2.2.
  • Communicating Execution Contexts using Channels.
  • Towards a Unified Interface for Data Movement.
  • Panel: What do you want in C++ for Heterogeneous.

As C++ is an important direction for OpenCL, we expect to see most of the discussions on programmability of OpenCL-enabled processors be done here.

The detailed program is to be released soon. For more information see the DHPCC++ webpage. At the IWOCL website you can buy tickets and passes – combining with IWOCL gives a discount.