PDFs of Monday 12 September

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza).  30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

Khronos OpenCL presentation at SIGGRAPH 2010

Here you find the videos uploaded by Khronos of their presentation about OpenCL. I added the time-line, so you can scroll to the more interesting parts easily. The presentation by Ofer Rosenberg of Intel and Cliff Woolly of NVIDIA were not uploaded (yet). Please note that for non-American people the speech of Affi Munchie is hard to hear; luckily his sheets explain most.

http://www.youtube.com/watch?v=BdZFtcQ2LYw

For the first two presentations the sheets can be downloaded from the Khronos-website. The time-line has the sheet-numbers mentioned.

0:00 [sheet 1] Presentation by the president of Khronos and chair of the session: Neill Trevett of NVIDIA.
0:06 [sheet 2] Welcome and a quick overview
1:12 [sheet 3] The prizes for the attendees (not us, online viewers)
1:40 [4] Overview of all members of Khronos. Khronos does not only take care of OpenCL but also the more famous OpenGL and projects like Collada.
2:26 [5] Processor Parallelism. CPUs are getting more parallel and GPUs more programmable. The overlapping area is called Heteregenous Computing and there is where OpenCL pops up.
3:10 [6] OpenCL timeline from version 1.0 to 1.1.
4:44 [7] OpenCL workinggroup with only 30 logos. He mentions missing logos like the one from Apple.
5:18 [8] The Visual Computing Ecosystem, where OpenCL interoperability with other standards are shown. The talk is not complete, so I don;t know if he talks about DirectX.

Continue reading “Khronos OpenCL presentation at SIGGRAPH 2010”

GPU Cloud

Put your software in the cloud and have your customers connect to your service at 100 MBit.

We offer two types of contracts:

  • Pey par day. Get dedicated GPUs for your OpenCL software.
  • Pey per call. When you need to be present to demo your solution, and your GPU-calls are short.

The servers run Ubuntu by default, and you get a chrooted environment for maximum performance. In most cases the software doesn’t need to be altered.

We offer full support, from software tuning to marketing your product.

OpenCL Books

[infobox type=”information”]opencl-20-small

Want a book written be us?

Format: PDF
Digital: 150-200 pages
Price: TBD
Author: The StreamHPC team
OpenCL-version: 2.0

[/infobox]

Below are the books that are available as downloadable PDF or as a book. Please contact us if a printed book is missing. Note that I tend to be critical to books, and not to be overwhelming positive and talk in superlatives. Most books target C and C++ developers, so be sure you learn the basics of C or C++ first before learning OpenCL. Most important are bit-shifts, pointers and structs, but also thinking more in hardware than in getting-things-done is needed to get the full potential out of OpenCL. While you get that book on C, also get a book on computer-architecture to understand the concepts of bandwidth to its fullest. Then you are ready for one of the pearls below. Happy reading!

The books are ordered descending by published date, except the first.

“The OpenCL specifications” by the Khronos Group

2.0 – current version

Format: PDF
File Size: 2.2MB
Digital: 281 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 13 July 2013 (version 2.0, revision 11)
OpenCL-version: 2.0
Homepage: http://www.khronos.org/registry/cl/http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

As a specifications-document, you cannot expect a nice piece of prose, but most of the knowledge you need to know is in it. There are certainly some gaps (especially in clear explanation), but every version is getting better. When studying other sources, always have this document with you as a reference. I printed it as two pages per side (A4).

Read chapters 1 to 3, and leave the rest as a reference. Other books explain the long, long lists of language specifications in a nicer form. Also the most you’ll better learn by doing.

1.2, 1.1, 1.0 – previous versions

Format: PDF
File Size: 3.3MB
Digital: 380 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 14 November 2012 (version 1.2, revision 19)
OpenCL-version: 1.2
Homepage: http://www.khronos.org/registry/cl/

As many software is still written in a previous version, it is good to know the differences. Differences between 1.1 and 1.2 are written here.


OpenCL Programming By Example

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

2342OT_OpenCL Programming By ExampleFormat: eBook/pBook
Pages: 304
Price: €4,50 (e), €50,- (p+e)
Publisher: PACKT
Authors: Ravishekhar Banger (AMD), Koushik Bhattacharyya (AMD)
Published Date: December 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-programming-by-example/book

“The OpenCL Programming Book” – Fixstars

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook
Pages: 325
Price: USD 19.50
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Aki Asahara, Satoshi Miki, Jeongdo Son and Satoshi Miki. Satoru Tagawa (translator)
Published Date: January 2012
OpenCL-version: 1.2
Homepage: http://www.fixstars.com/en/opencl/book/


Format: eBook
Pages: 246
Price: free
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki
Published Date: 31 March 2010
OpenCL-version: 1.0
Homepage: http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/

1.0-version: It seems to be translated from Japanese to English, but except some small typos and spelling errors the book is very easy to read. The book explains the chapters you could skip in Khronos’ specifications-document, but certainly is not complete since it discusses OpenCL 1.0 and has a focus on the basics. The parts where that build up a program step-by-step is a bit annoying to read, because they repeat the whole program again while only a few lines have changed. The book would be more like 180-200 pages if written more compact.

1.2-version: Thicker, more up to date and a promise there are less translation-errors.

Heterogeneous Computing with OpenCL, second edition

Format: pBook
Pages: 400 (approx.)
Price: USD 69.95
Publisher: Morgan Kaufmann
Authors: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa
Published Date: Sept 2011
OpenCL-version: 1.2
Homepage: http://www.elsevierdirect.com/product.jsp?isbn=9780123877666

This is where we all chose OpenCL for: hybrid processors. And this book dives into that world completely, so we actually learn a lot new stuff about the advantages of having a GPU on your lap.

The new edition upgrades the book to version 1.2, but nothing much new has been added. So if you have the first edition, there is no need to buy the second edition.

OpenCL in Action

Format: eBook + pBook
Pages: 475
Price: USD 47.99 (e). USD 59.99 (p+e)
Publisher: Addison-Wesley Professional
Authors: Matthew Scarpino
Published Date: non-final version updated regularly, target November 2011
OpenCL-version: 1.1
Homepage: http://www.manning.com/scarpino2/

Matthew Scarpino also wrote SWT/JFace In Action and Programming the Cell-processor, has a profession in Linux and has much experience in IT. The book seems to target an audience who want a more practical guide to learn OpenCL. He runs a blog at http://www.openclblog.com/

It is currently my favourite book and this one is a must-have for everybody interested or working with OpenCL.

OpenCL Programming Guide

Format: PDF and/or print
Pages: 648
Price: USD 35.19 (e), USD 43.99 (p), USD 59.39 (p+e)
Publisher: Addison-Wesley Professional
Authors: Aaftab Munshi (Apple, Khronos Group), Benedict Gaster (AMD), Timothy G. Mattson, Dan Ginsburg
Published Date: August 2011
OpenCL-version: 1.1
Homepage: http://my.safaribooksonline.com/9780132488006 and http://www.openclprogrammingguide.com/

Aaftab Munshi is also responsible for the OpenCL-specifications, so he probably knows where he’s talking about.

The 648 pages it is quite bigger than the targeted 480. Currently this is a very good replacement for Fixstars’ book. Disadvantage is that sending the printed book overseas (not USA/Canada) is much too expensive and people from the Eurasian continent, Africa and Latin America should just print it locally – looking into that to find better options.

OpenCL Parallel Programming Development Cookbook

pact

Format: pBook + eBook
Pages: 303 pages
Price: USD 26.39 (e) / 54.99 (p+e)
Publisher: PACT publishing
Author: Raymond Tay
Published Date: August 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-parallel-programming-development-cookbook/book

Introductory book for OpenCL beginners. Examples: histogram, Sobel edge detection, Matrix Multiplication, Sparse Matrix Vector Multiplication, Bitonic sort, Radix sort, n-body.

Programming Massively Parallel Processors

Format: pBook
Pages: 258 pages
Price: USD 46.40
Publisher: Morgan Kaufmann
Authors: David B. Kirk (NVIDIA) and Wen-mei W. Hwu (University of Illinois)
Published Date: 28 January 2010
OpenCL-version: 1.1?
Homepage: http://blogs.nvidia.com/ntersect/2010/01/worlds-first-textbook-on-programming-massively-parallel-processors.html

The book claims to discuss both OpenCL and CUDA, but actually there is just one chapter on OpenCL and the focus is strong towards NVIDIA hardware. It is a nice book for people who need to learn to program CUDA-only software/hardware and don’t want a book that’s too hard to understand. There are assignments at the end of each chapter and important subjects are explained in detail, so you don’t need to have a hard time with those assignments.

It is not good for people interested in OpenCL-compliant architectures from AMD, ARM and IBM besides NVIDIA’s. It is one of the best resources to understand NVIDIA architectures from a view of a GPGPU-programmer. The second edition adds more chapters on for example MPI and OpenACC. It is also less negative about OpenCL than in the first edition.

Install OpenCL on Debian, Ubuntu and Mint orderly

Libraries – can’t have enough

If you read different types of manuals how to compile OpenCL software on Linux, then you can get dizzy of all the LD-parameters. Also when installing the SDKs from AMD, Intel and NVIDIA, you get different locations for libraries, header-files, etc. Now GPGPU is old-fashioned and we go for heterogeneous programming, the chances get higher you will have more SDKs on your machine. Also if you want to keep it the way you have, reading this article gives you insight in what the design is after it all. Note that Intel’s drivers don’t give OpenCL support for their GPUs, but CPUs only.

As my mother said when I was young: “actually cleaning up is very simple”. I’m busy creating a PPA for this, but that will take some more time.

First the idea. For developers OpenCL consists of 5 parts:

  • GPUs-only: drivers with OpenCL-support
  • The OpenCL header-files
  • Vendor specific libraries (needed when using -lOpenCL)
  • libOpenCL.so -> a special driver
  • An installable client driver

Currently GPU-drivers are always OpenCL-capable, so you only need to secure 4 steps. These are discussed below.

Please note that in certain 64-bit distributions there is not lib64, but only ‘lib’ and ‘lib32’. If that is the case for you, you can use the commands that are mentioned with 32-bit.

Continue reading “Install OpenCL on Debian, Ubuntu and Mint orderly”

Scientific Visualisation of Molecules

In many hard sciences focus is on formulas and text, whereas images are mainly graphs or simplified representations of researched matters. Beautiful visualisations are mainly artist’s impressions in popular media targeting hobby-scientists. When Cyrille Favreau made the first good-working version of his real-time GPU-accelerated raytracer, he saw potential in exactly this area: beautiful, realistic visualisations to be used in serious science. This resulted in software called IPV.

He chose to focus on rendering molecules of proteins and this article discusses raytracing in molecular sciences, while highlighting the features of the software.

This project has been discussed on GPU Science, but this article looks at the the software from a slightly different perspective. If you don’t want to know how the software works and what it can do, scroll down for a download-link.

Continue reading “Scientific Visualisation of Molecules”

OpenCL alternatives for CUDA Linear Algebra Libraries

While CUDA has had the advantage of having many more libraries, this is no longer its main advantage if it comes to linear algebra. If one thing changed over the past year, then it is linalg library-support for OpenCL. The choices have been increased at a continuous rate, as you can see the below list.

A general remark when using these libraries. When using them you need to handle your data-transfers and correct data-format, with great care. If you don’t think it through, you won’t get the promised speed-up. If not mentioned, then free.

Subject CUDA OpenCL
FFT
cuff_ampchart

The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the…

clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.
Linear Algebra
MAGMA-Logo

MAGMA is a collection of next generation, GPU accelerated ,linear algebra libraries. Designed for heterogeneous GPU-based architectures. It supports interfaces to current LAPACK and BLAS standards.

clMAGMA is an OpenCL port of MAGMA for AMD GPUs. The clMAGMA library dependancies, in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for AMD hardware, can be found in the AMD Accelerated Parallel Processing Math Libraries (APPML).
Sparse Linear Algebra
cusp_logo

CUSP is an open source C++ library of generic parallel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs. CUSP provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.

clBLAS implements the complete set of BLAS level 1, 2 & 3 routines. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release ViennaCL 1.5.0 provides many new convenience functions and support for integer vectors and matrices.VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported.
Random number generation
cuRandImage

The NVIDIA CUDA Random Number Generation library (cuRAND) delivers high performance GPU-accelerated random number generation (RNG). The cuRAND library delivers high quality random numbers 8x…

The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL). They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.
KeyVisual_Primary_verysm

The CUDA Math library is an industry proven, highly accurate collection of standard mathematical functions. Available to any CUDA C or CUDA C++ application simply by adding “#include math.h” in…

Looking into the details of what the CUDA math lib exactly is.
AI
GPU_AI_games

A technology preview with CUDA accelerated game tree search of both the pruning and backtracking styles. Games available: 3D Tic-Tac-Toe, Connect-4, Reversi, Sudoku and Go.

There are many tactics to speed up such algorithms. This CUDA-library can therefore only be used for limited cases, but nevertheless it is a very interesting research-area. Ask us for an OpenCL based backtracking and pruning tree searching, tailored for your problem.
Dense Linear Algebra
CULAtoolslogo2
Provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. Also provides various solvers.
Free (with limitations) and commercial.
See ViennaCL, VexCL and clBLAS above. Kudos to the CULA-team, as they were one of the first with a full GPU-accelerated linear algebra product.
Fortran
RogueWave-IMSL-Box2
The IMSL Fortran Numerical Library is a comprehensive set of mathematical and statistical functions available that offloads CPU work to NVIDIA GPU hardware where the cuBLAS library is utilized.
Free (with limitations) and commercial.
OpenCL-FORTRAN is not available yet. Contact us, if you have interest and wish to work with a pre-release once available.
Subject
arrayfire_logo340

Comprehensive GPU function library, including functions for math, signal processing, image processing, statistics, and more. Interfaces for C, C++, Fortran, and Python. Integrates with any CUDA-program.

Free (with limitations) and commercial.

ArrayFire 2.0 is also available for OpenCL. Note that currently fewer functions are supported in the OpenCL-version than are supported in CUDA-ArrayFire, so please check the OpenCL documentation for supported feature list.Free (with limitations) and commercial.
Subject
nppeye

The NVIDIA Performance Primitives library (NPP) is a collection of over 1900 image processing primitives and nearly 600 signal processing primitives that deliver 5x to 10x faster performance than…

Kudos for NVIDIA for bringing it all at one place. OpenCL-devs have to do some googling for specific algorithms.

So the gap between CUDA and OpenCL is certainly closing. CUDA provides a lot more convenience, so OpenCL-devs still have to keep reading blogs like this one to find what’s out there.

As usual, if you have additions to this list (free and commercial), please let me know in the comments below or by mail. I also have a few more additions to this list myself – depending on your feedback, I might represent the data differently.

How to introduce HPC in your enterprise

eviljaymz-spare-time
Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

  • Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
  • NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
  • AMD: For the S9150 get a W9100.
  • Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the  results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

P_setting_fff_1_90_end_500.pngIf your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

OpenCL – the battle, part III

The first two parts described hardware-companies and operating systems, programming languages and software-companies, written about half a year ago. Now we focus on what has driven NVIDIA and ATI/AMD for decades: games.

Disclaimer: this is an opinion-piece on the current market. We are strong supporters of OpenCL and all companies which support it too. Since our advise on specific hardware in a consult will be based on specific demands on the customer, we could advise differently than would be expected on the below article.

Games

Computer games are cool; merely because you choose from so many different kinds. While Tetris will live forever, the latest games also have something to add: realistic physics simulation. And that’s what’s done by GPUs now. Nintendo has shown us that gameplay and good interaction are far more important than video-quality. The wow-factor for photo-realistic real-time rendering is not as it was years ago.
You might know the basics for falling objects: F = m*g (Force = Mass times Gravity-acceleration), and action = – reaction. If you drop some boxes, you can predict falling speed, interaction, rotation and possible change of centre of gravity from a still image as a human being. A computer has to do a lot more to detect collision, but the idea is very doable on a fast CPU. A very well-known open source library for these purposes is Bullet Physics. The nice thing comes, when there is more than just a few boxes, but thousands of them. Or when you walk through water or under a waterfall, see fire and smoke, break wood but bend metal, etc. The accelerometer of the iPod was a game-changer too in the demand for more realism in graphics. For an example of a “physics puzzle game” not using GPGPU see World of Goo (with free demo) – for the rest we talk more about high-end games. Of current game-ready systems PCs (Apple, Linux and Windows) have OpenCL support, Sony PlayStation 3 is now somewhat vague and the Xbox 360 has none.

The picture is from Crysis 3, which does not use OpenCL, as we know it.

Continue reading “OpenCL – the battle, part III”

OpenCL in the cloud – API beta launching in a month

No_coulds_atmWe’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

ZiiLabs Tablet

[infobox type=”information”]

Need a ZiiLabs ZMS-40 programmer? Hire us!

[/infobox]

Intel has bought ZiiLabs, but you can still order the ZMS-40.

ZiiLabs has an early access program for OpenCL on their StemCell processor, the 100-Core ZMS-40. It could do more than 20 GFLOPS/Watt, but no official numbers have been released.

It consists of:

  • ZMS-40 powered tablet
  • OpenCL compiler (no information if it is cross or native)
  • Code samples

Read more at http://www.ziilabs.com/products/software/opencl.php about their program. Also check the information on the ZMS-40 to see what the processor is capable of. Here are a few characteristics:

  • Quad 1.5 GHz ARM Cortex-A9 MP Cores
  • 96x fully-programmable StemCell Media Processing cores
  • 58 GFlops StemCell compute power

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

The 13 application areas where OpenCL and CUDA can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used”

Call for speakers: IEEE eScience Conference in Amsterdam

We’re in the program committee of the 14th IEEE eScience Conference in Amsterdam, organized by the Netherlands eScience Center. It will be held from 29 October to 1 November 2018, and the deadlines for sending the abstracts is Monday 18 June.

The conference brings together leading international researchers and research software engineers from all disciplines to present and discuss how digital technology impacts scientific practice. eScience promotes innovation in collaborative, computationally- or data-intensive research across all disciplines, throughout the research lifecycle.

Continue reading “Call for speakers: IEEE eScience Conference in Amsterdam”

OpenCL under Wine

The Wine 1.3 branch has support for OpenCL 1.0 since 1.3.9. Since Microsoft likes to get a little part of the Linux-dominated HPC-market, support for GPGPU is pretty good under the $799.00 costing Visual Studio – the free Express-version is not supported well. But why not take the produced software back via Wine? Problem is that OpenCL is not in the current Wine binaries for some reason, but that is fixable until we wait for inclusion…

Lazy or not much time? You can try my binaries (Ubuntu 32, NVIDIA), but I cannot guarantee they work for you and it is on your own responsibility: download (reported not working by some). See second part of step 3, what to do with it.

All the steps

I assume you have the OpenCL-SDK installed, but let me know if I need to add more details or clear up some steps.

1 – get the sources

The sources are available here. Be sure you download at least version 1.3.9. Alternatively you download the latest from git. You can get it by going to a directory and execute:

git clone git://source.winehq.org/git/wine.git

A directory “wine” will be created. That was easy, so lets go to bake some binaries.

Continue reading “OpenCL under Wine”

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL”

NVIDIA’s answer to SandyBridge and Fusion

Intel has Sandy Bridge, AMD has Fusion, now NVIDIA has a combination of CPU and GPU too: Project Denver. The only difference is that it is not X86-based, but an ARM-architecture. And most-probable the most powerful ARM-GPU of 2011.

For years there were ARM-based Systems-on-a-chip: a CPU and a GPU combined (see list below). On the X86-platform the “integrated GPU” was on the motherboard, and since this year now both AMD/ATI and Intel hit this “new market”.The big advantage is that it’s cheaper to produce, is more powerful per Watt (in total) and has good acceleration-potential. NVIDIA does not have X86-chips and would have been the big loser of 2011; they did everything to reinvent themselves: 3D was reintroduced, CUDA was actively developed and pushed (free libraries and tools, university-programs, many books and trainings, Tesla, etc), a mobile Tegra graphics solution [1] (see image at the right),  and all existing products got extra backing from the marketing-department. A great time for researchers who needed to get free products in exchange of naming NVIDIA in their research-reports.

NVIDIA chose for ARM; interesting for who is watching the CUDA-vs-OpenCL battle, since CUDA was for GPUs of NVIDIA on X86 and ARM was solely for OpenCL. Period. In the contrary to their other ARM-based chips, this new chip probably won’t be in smartphones (yet); it targets systems that need more GPU-power like CUDA and games.

In a few days the article about Windows-on-ARM is to be released, which completes this article.

Continue reading “NVIDIA’s answer to SandyBridge and Fusion”

OpenCL at SC14

SC14During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14”

Intel’s OpenCL SDK examples for GCC

Update august 2012: There is a new post for the latest Linux examples.

Note: these patches won’t work anymore! You can learn from the patches how to fix the latest SDK-code for GCC and Linux/OSX.

Code-examples are not bundled with the Linux OpenCL SDK 1.1 beta. Their focus is primarily Windows, so VisualStudio seems to be a logical target. I just prefer GCC/LLVM which you can get to work with all OSes. After some time trying to find the alternatives for MS-specific calls, I think I managed. Since ShallowWater uses DirectX and is quite extensive, I did not create a patch for that one – sorry for that.

I had a lot of troubles getting the BMP-export to work, because serialisation of the struct added an extra short. Feedback (such as a correct BMP-export of a file) is very welcome, since I the colours are correct. For the rest: most warnings are removed and it just works – tested with g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2 on 64 bit (llvm-g++-4.2 seems to work too, but not fully tested).

THE PATCHES ARE PROVIDED AS IS – NO WARRANTIES!

Continue reading “Intel’s OpenCL SDK examples for GCC”

Customer: “So you also do full projects?”

One of these moments when you find out that the company is not seen as I want it to be seen. Compared to generic software engineering companies, we have the advantage of creating software that is capable of processing more data.

For years we have promoted this unique advantage, for which we use OpenCL, CUDA, HIP and several other languages. Always having full projects in mind, but unfortunately not clearly communicating this.

During discussions with several existing and new customers, it became suddenly clear that we are seen as a company that fixes code, not one that builds the full code.

It became most clear that when was suggested to let us collaborate with another party, where our role would be to make sure they would not make mistakes regarding performance and code-quality.

  • Customer:   You can work with a team we hired before.
  • Us:             We also do full projects.
  • Customer:   Really?

This would mean we would be the seniors in the group, but not own the project – a suboptimal situation, as important design decisions could be ignored. Continue reading “Customer: “So you also do full projects?””

Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs

ZiiLabs has been offering an early access program for OpenCL SDK since last year. This program was very selective in choosing developers and little news has been put on their webpage. Now they are planning to make their Android NDK a standard component, it’s a good time to ask them some questions. GPGPU-consultant Liad Weinberger of Appilo also added a few questions.

The Q&A has been with Tim Lewis, director Marketing and Partner Relations of ZiiLabs, who has taken the time to give some insights in what we can expect around accelerated computations on Android. ZiiLabs has been better known as 3DLabs and has reinvented itself in 2009 (you can read the full history here). Like other companies in the ARM-industry they mostly design chips and let other parties manufacture devices using their schematics, drivers and software. Now to the questions.

Continue reading “Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs”