Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

For many “good enough”
Faster code, the easy way.
Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Better caching can give more boost than using faster hardware
Software running in parallel is a first step to GPU-computing
Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

The highest performance is guaranteed
Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
Code-review of CPU-code (Java, C, C++ and more).
Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

Find parallellizable computations
Give the fitness to run on GPUs
Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Architecture check-up
Data-transport measurements
Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

StreamHPC is at HiPEAC, Berlin

Posted by Vincent Hindriksen on 9 January 2013

HiPEAC is the conference on “High-performance and Embedded Architectures and Compilers” to be held on 20 to 22 January in Berlin – tickets are still available, if you can spare three days. I (Vincent Hindriksen) would like to share the program I’m doing.

Edit: I met great people here and hope to see you next year (again).

Continue reading “StreamHPC is at HiPEAC, Berlin” →

Faster Development Cycles for FPGAs

normal-vs-opencl-fpga-flow — The time-difference between the normal and OpenCL flow is large. The final product is as fast and efficient.

VHDL and Verilog are not the right tools when it comes to developing on FPGAs fast.

It is time-consuming. If the first cycle takes 3 months, then each subsequent cycle easily takes 2 weeks. Time is money.
Porting or upgrading a design from one FPGA device to another is also time-consuming. This makes it essential to choose the final FPGA vendor and family upfront.
Dual-platform development on CPU and FPGA needs synchronisation. The code works on either the CPU or the FPGA, which makes the functional tests made for the CPU-version less trustworthy.

Here is where OpenCL comes in.

Shorter development cycles. Programming in OpenCL is normally much faster than in VHDL or Verilog. If you are porting C/C++ code onto FPGA the development cycles will be dramatically shorter. Think weeks instead of months – as this news article explains. This means a radically reduced investment as well as providing time for architectural exploration.
OpenCL works on both CPUs and FPGAs, so functional tests can be run on either. As a bonus the code can be optimised for GPUs, within a short time-frame.
The performance is equal to VHDL and Verilog, unless FPGA-specific optimisations are used, such as vector-widths not equal to a power of two.
Vendor Agnostic solution. Porting to other FPGAs takes considerably less time and the compiler solves this problem for you.
Both Xilinx and Altera have OpenCL compilers. Altera were the first to come out with an OpenCL offering and have a full SDK, which is an add-on to Quartus II. Xilinx also have a stand-alone OpenCL development environment solution called SDAccel.

Support for OpenCL is strong by both Altera and Xilinx

Both vendors suggest OpenCL to overcome existing FPGA design problems. Altera suggest to use OpenCL to speed-up the process for existing developers. So OpenCL is not a third party tool, you need to trust separately.

OpenCL allows a user to abstract away the traditional hardware FPGA development flow for a much faster and higher level software development flow – Altera

Xilinx suggests that OpenCL can enable companies without the needed developer resources to start working with FPGAs.

Teams with limited or no FPGA hardware resources, however, have found the transition to FPGAs challenging due to the RTL (VHDL or Verilog) development expertise needed to take full advantage of these devices. OpenCL eases this programming burden – Xilinx

Why choose StreamHPC?

There are several reasons to choose letting us to do the porting and protoyping of your product.

We have the right background, as our team consists of CPU, GPU and FPGA developers. Our code is therefore designed with easy porting in mind.
Our costs are lower than having the product done in Verilog/VHDL.
We give guarantees and support for our products on all platforms the product is ported on.
We can port the final OpenCL code to Verilog/VHDL, keeping the same performance. In case you don’t trust a high-level language, we have you covered.
Optionally you can get both the code and a technical report with a detailed explanation of how we did it. So you can learn from this and modify the code yourself.
You get free advice on when (and not) to use OpenCL for FPGAs.

There are three ways to get in contact quickly:

call: +31 854865760 (European office hours)

e-mail: contact@streamhpc.com

Fill in this form – mention when you want to be called back (possible outside normal office hours):

[contact_form]

Want to read more?

We wrote about OpenCL-on-FPGAs on our blog in the previous years.

Why use OpenCL on FPGAs? (2014)
Starting with Altera OpenCL-on-FPGAs (2013)
OpenCL on Altera FPGAs (2012)

A short story: OpenCL at LaSEEB (Lisboa, Portugal)

Posted by Vincent Hindriksen on 19 September 2014

The research lab LaSEEB (Lisboa, Portugal) is active in the areas of Biomedical Engineering, Computational Intelligence and Evolutionary Systems. They create software using OpenCL and CUDA to speed-up their research and simulations.

They were one of the first groups to try out OpenCL, even before StreamHPC existed. To simplify the research at the lab, Nuno Fachada created cf4ocl – a C Framework for OpenCL. During an e-mail correspondence with Nuno, I asked to tell something about how OpenCL was used within their lab. He was happy to share a short story.

We started working with OpenCL since early 2009. We were working with CUDA for a few months, but because OpenCL was to be adopted by all platforms, we switched right away. At the time we used it with NVidia GPUs and PS3’s, but had many problems because of the verbosity of its C API. There were few wrappers, and those that existed weren’t exactly stable or properly documented. Adding to this, the first OpenCL implementations had some bugs, which didn’t help. Some people in the lab gave up using OpenCL because of these initial problems.

All who started early on, recognises the problems described above. Now fast-forward to now.

Nowadays its much easier to use, due to the stability of the implementations, and the several wrappers for different programming languages. For C however, I think cf4ocl is the most complete and well documented. The C people here in the lab are getting into OpenCL again due to it (the Python guys were already into it again due to the excellent PyOpenCL library). Nowadays we’re using OpenCL for AMD and NVidia GPUs and multicore CPUs, including some Xeons.

This is what I hear more often lately: the return of OpenCL to research labs and companies, after a several years of CUDA. It’s a combination of preparing for the long-term, growing interest in exploring other and/or cheaper(!) accelerators than NVIDIA’s GPUs, and OpenCL being ready for prime-time.

♦

Do you have your story of how OpenCL is used within your lab or company? Just contact us.

Are you interested in other wrappers for OpenCL, See this list on our knowledge base.

Waiting for Mobile OpenCL – Q1 2011

Posted by Vincent Hindriksen on 14 February 2011 with 3 Comments

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

Adapting code for embedded profile
Added macro __EMBEDDED_PROFILE__
CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
Online compiler is optional
No 64-bit integers
Reduced requirements for constant buffers, object allocation, constant argument count and local memory
Image & floating point support matches OpenGL ES 2.0 texturing
The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011” →

OpenCL Books

[infobox type=”information”]

Want a book written be us?

Format: PDF
Digital: 150-200 pages
Price: TBD
Author: The StreamHPC team
OpenCL-version: 2.0

[/infobox]

Below are the books that are available as downloadable PDF or as a book. Please contact us if a printed book is missing. Note that I tend to be critical to books, and not to be overwhelming positive and talk in superlatives. Most books target C and C++ developers, so be sure you learn the basics of C or C++ first before learning OpenCL. Most important are bit-shifts, pointers and structs, but also thinking more in hardware than in getting-things-done is needed to get the full potential out of OpenCL. While you get that book on C, also get a book on computer-architecture to understand the concepts of bandwidth to its fullest. Then you are ready for one of the pearls below. Happy reading!

The books are ordered descending by published date, except the first.

“The OpenCL specifications” by the Khronos Group

2.0 – current version

Format: PDF
File Size: 2.2MB
Digital: 281 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 13 July 2013 (version 2.0, revision 11)
OpenCL-version: 2.0
Homepage: http://www.khronos.org/registry/cl/ – http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

As a specifications-document, you cannot expect a nice piece of prose, but most of the knowledge you need to know is in it. There are certainly some gaps (especially in clear explanation), but every version is getting better. When studying other sources, always have this document with you as a reference. I printed it as two pages per side (A4).

Read chapters 1 to 3, and leave the rest as a reference. Other books explain the long, long lists of language specifications in a nicer form. Also the most you’ll better learn by doing.

1.2, 1.1, 1.0 – previous versions

Format: PDF
File Size: 3.3MB
Digital: 380 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 14 November 2012 (version 1.2, revision 19)
OpenCL-version: 1.2
Homepage: http://www.khronos.org/registry/cl/ –

As many software is still written in a previous version, it is good to know the differences. Differences between 1.1 and 1.2 are written here.

OpenCL Programming By Example

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook/pBook
Pages: 304
Price: €4,50 (e), €50,- (p+e)
Publisher: PACKT
Authors: Ravishekhar Banger (AMD), Koushik Bhattacharyya (AMD)
Published Date: December 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-programming-by-example/book

“The OpenCL Programming Book” – Fixstars

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook
Pages: 325
Price: USD 19.50
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Aki Asahara, Satoshi Miki, Jeongdo Son and Satoshi Miki. Satoru Tagawa (translator)
Published Date: January 2012
OpenCL-version: 1.2
Homepage: http://www.fixstars.com/en/opencl/book/

Format: eBook
Pages: 246
Price: free
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki
Published Date: 31 March 2010
OpenCL-version: 1.0
Homepage: http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/

1.0-version: It seems to be translated from Japanese to English, but except some small typos and spelling errors the book is very easy to read. The book explains the chapters you could skip in Khronos’ specifications-document, but certainly is not complete since it discusses OpenCL 1.0 and has a focus on the basics. The parts where that build up a program step-by-step is a bit annoying to read, because they repeat the whole program again while only a few lines have changed. The book would be more like 180-200 pages if written more compact.

1.2-version: Thicker, more up to date and a promise there are less translation-errors.

Heterogeneous Computing with OpenCL, second edition

Format: pBook
Pages: 400 (approx.)
Price: USD 69.95
Publisher: Morgan Kaufmann
Authors: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa
Published Date: Sept 2011
OpenCL-version: 1.2
Homepage: http://www.elsevierdirect.com/product.jsp?isbn=9780123877666

This is where we all chose OpenCL for: hybrid processors. And this book dives into that world completely, so we actually learn a lot new stuff about the advantages of having a GPU on your lap.

The new edition upgrades the book to version 1.2, but nothing much new has been added. So if you have the first edition, there is no need to buy the second edition.

OpenCL in Action

Format: eBook + pBook
Pages: 475
Price: USD 47.99 (e). USD 59.99 (p+e)
Publisher: Addison-Wesley Professional
Authors: Matthew Scarpino
Published Date: non-final version updated regularly, target November 2011
OpenCL-version: 1.1
Homepage: http://www.manning.com/scarpino2/

Matthew Scarpino also wrote SWT/JFace In Action and Programming the Cell-processor, has a profession in Linux and has much experience in IT. The book seems to target an audience who want a more practical guide to learn OpenCL. He runs a blog at http://www.openclblog.com/

It is currently my favourite book and this one is a must-have for everybody interested or working with OpenCL.

OpenCL Programming Guide

Format: PDF and/or print
Pages: 648
Price: USD 35.19 (e), USD 43.99 (p), USD 59.39 (p+e)
Publisher: Addison-Wesley Professional
Authors: Aaftab Munshi (Apple, Khronos Group), Benedict Gaster (AMD), Timothy G. Mattson, Dan Ginsburg
Published Date: August 2011
OpenCL-version: 1.1
Homepage: http://my.safaribooksonline.com/9780132488006 and http://www.openclprogrammingguide.com/

Aaftab Munshi is also responsible for the OpenCL-specifications, so he probably knows where he’s talking about.

The 648 pages it is quite bigger than the targeted 480. Currently this is a very good replacement for Fixstars’ book. Disadvantage is that sending the printed book overseas (not USA/Canada) is much too expensive and people from the Eurasian continent, Africa and Latin America should just print it locally – looking into that to find better options.

OpenCL Parallel Programming Development Cookbook

Format: pBook + eBook
Pages: 303 pages
Price: USD 26.39 (e) / 54.99 (p+e)
Publisher: PACT publishing
Author: Raymond Tay
Published Date: August 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-parallel-programming-development-cookbook/book

Introductory book for OpenCL beginners. Examples: histogram, Sobel edge detection, Matrix Multiplication, Sparse Matrix Vector Multiplication, Bitonic sort, Radix sort, n-body.

Programming Massively Parallel Processors

Format: pBook
Pages: 258 pages
Price: USD 46.40
Publisher: Morgan Kaufmann
Authors: David B. Kirk (NVIDIA) and Wen-mei W. Hwu (University of Illinois)
Published Date: 28 January 2010
OpenCL-version: 1.1?
Homepage: http://blogs.nvidia.com/ntersect/2010/01/worlds-first-textbook-on-programming-massively-parallel-processors.html

The book claims to discuss both OpenCL and CUDA, but actually there is just one chapter on OpenCL and the focus is strong towards NVIDIA hardware. It is a nice book for people who need to learn to program CUDA-only software/hardware and don’t want a book that’s too hard to understand. There are assignments at the end of each chapter and important subjects are explained in detail, so you don’t need to have a hard time with those assignments.

It is not good for people interested in OpenCL-compliant architectures from AMD, ARM and IBM besides NVIDIA’s. It is one of the best resources to understand NVIDIA architectures from a view of a GPGPU-programmer. The second edition adds more chapters on for example MPI and OpenACC. It is also less negative about OpenCL than in the first edition.

Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

OpenCL alternatives for CUDA Linear Algebra Libraries

Posted by Vincent Hindriksen on 16 January 2014 with 4 Comments

While CUDA has had the advantage of having many more libraries, this is no longer its main advantage if it comes to linear algebra. If one thing changed over the past year, then it is linalg library-support for OpenCL. The choices have been increased at a continuous rate, as you can see the below list.

A general remark when using these libraries. When using them you need to handle your data-transfers and correct data-format, with great care. If you don’t think it through, you won’t get the promised speed-up. If not mentioned, then free.

Subject	CUDA	OpenCL
FFT	CUFFT The NVIDIA CUDA Fast Fourier Transform library (cuFFT) provides a simple interface for computing FFTs up to 10x faster. By using hundreds of processor cores inside NVIDIA GPUs, cuFFT delivers the…	clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.
Linear Algebra	MAGMA MAGMA is a collection of next generation, GPU accelerated ,linear algebra libraries. Designed for heterogeneous GPU-based architectures. It supports interfaces to current LAPACK and BLAS standards.	clMAGMA is an OpenCL port of MAGMA for AMD GPUs. The clMAGMA library dependancies, in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for AMD hardware, can be found in the AMD Accelerated Parallel Processing Math Libraries (APPML).
Sparse Linear Algebra	CUSP CUSP is an open source C++ library of generic parallel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs. CUSP provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.	clBLAS implements the complete set of BLAS level 1, 2 & 3 routines. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming.ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. In addition to core functionality and many other features including BLAS level 1-3 support and iterative solvers, the latest release ViennaCL 1.5.0 provides many new convenience functions and support for integer vectors and matrices.VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported.
Random number generation	cuRAND The NVIDIA CUDA Random Number Generation library (cuRAND) delivers high performance GPU-accelerated random number generation (RNG). The cuRAND library delivers high quality random numbers 8x…	The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL). They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.
	CUDA Math Library The CUDA Math library is an industry proven, highly accurate collection of standard mathematical functions. Available to any CUDA C or CUDA C++ application simply by adding “#include math.h” in…	Looking into the details of what the CUDA math lib exactly is.
AI	GPU AI for Board Games A technology preview with CUDA accelerated game tree search of both the pruning and backtracking styles. Games available: 3D Tic-Tac-Toe, Connect-4, Reversi, Sudoku and Go.	There are many tactics to speed up such algorithms. This CUDA-library can therefore only be used for limited cases, but nevertheless it is a very interesting research-area. Ask us for an OpenCL based backtracking and pruning tree searching, tailored for your problem.
Dense Linear Algebra	EM Photonics CULA Tools Provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. Also provides various solvers. Free (with limitations) and commercial.	See ViennaCL, VexCL and clBLAS above. Kudos to the CULA-team, as they were one of the first with a full GPU-accelerated linear algebra product.
Fortran	IMSL Fortran Numerical Library The IMSL Fortran Numerical Library is a comprehensive set of mathematical and statistical functions available that offloads CPU work to NVIDIA GPU hardware where the cuBLAS library is utilized. Free (with limitations) and commercial.	OpenCL-FORTRAN is not available yet. Contact us, if you have interest and wish to work with a pre-release once available.
Subject	AccelerEyes ArrayFire Comprehensive GPU function library, including functions for math, signal processing, image processing, statistics, and more. Interfaces for C, C++, Fortran, and Python. Integrates with any CUDA-program. Free (with limitations) and commercial.	ArrayFire 2.0 is also available for OpenCL. Note that currently fewer functions are supported in the OpenCL-version than are supported in CUDA-ArrayFire, so please check the OpenCL documentation for supported feature list.Free (with limitations) and commercial.
Subject	NVIDIA Performance Primitives The NVIDIA Performance Primitives library (NPP) is a collection of over 1900 image processing primitives and nearly 600 signal processing primitives that deliver 5x to 10x faster performance than…	Kudos for NVIDIA for bringing it all at one place. OpenCL-devs have to do some googling for specific algorithms.

So the gap between CUDA and OpenCL is certainly closing. CUDA provides a lot more convenience, so OpenCL-devs still have to keep reading blogs like this one to find what’s out there.

As usual, if you have additions to this list (free and commercial), please let me know in the comments below or by mail. I also have a few more additions to this list myself – depending on your feedback, I might represent the data differently.

AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Posted by Vincent Hindriksen on 13 September 2012 with 8 Comments

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000” →

StreamHPC launches monthly trainings in Europe

Posted by Vincent Hindriksen on 19 November 2013

Every second Monday of the month StreamHPC offers an OpenCL training in Mathematics or Media-operations. The target is OpenCL 1.2 (or 1.1 when NVIDIA is discussed). OpenCL 2.0 trainings will start in Q2/Q3, or when on request. All trainings will be given by experienced OpenCL developers/trainers.

Trainings

Trainings take 3½ days, from basics on the first morning to the special requests on the 4th morning, either with your own laptop, or logging to a compute-server.

The Media-operations module is based on Heterogeneous Computing with OpenCL, second edition by Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa.

It covers convolution, video-processing, histogram and mixed particle simulation. Extra subjects are OpenCL-OpenGL interop and code-optimisation.

A good fit if you work with images, sound and video.

The Mathematics module is based on OpenCL in Action by Matthew Scarpino.

It covers reduction, sorting, matrix-operations and signal processing.

If you work on graphs, matrices and data-manipulation, this is for you!

Continue reading “StreamHPC launches monthly trainings in Europe” →

Engineering GPGPU into existing software

Posted by Vincent Hindriksen on 5 November 2010 with 1 Comment

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software” →

Assessment of existing code-base’s quality

You found the main computation takes over 90% of the processing time, or you found the framework to be slow in general. We got in contact and discussed speeding up your software, after which we shared this assessment with you. This assessment should give you insights if the software-project is ready to be shared with Stream HPC.

The larger the code-base, the more important its code-quality

The first step is to prepare code for porting or optimization. As it’s not always easy to know what to do, we’ve defined 3 levels of code quality. The higher the quality-level, the fewer obstacles for the project and the lower the costs, the less communication is required and the fewer frustrations.

Preparing a project for porting / optimizing

The sections below discuss the 3 levels where a project can be. The goal is that you do a self-assessment and write down answers for each question with details, not only provide the final answer.

The code needs to have all levels marked in red: full level 1 and the high level of level 2. It does not matter if the existing code is written in Matlab, Python, C, C++, Assembly, OpenCL, CUDA or any other language.

The action points of all levels need to be done. When a project is not ready yet, we can assist in improving code-quality, but assume a lot has to be done by your team. This will be a separate (pre-)project, and the full estimation for the porting/optimization can only be done after that.

If not possible to level up the software or no source files are available, it will be handled as a black box project or R&D project. Do know that such projects can never be done fixed priced and are always unique. The generic part of the process is described in this blog – we are experienced in doing such focused R&D projects.

Level 1: Understandability

Goal: Can the software be understood without help from the main developers?

High level

Are the algorithms explained in i.e. scientific papers?
Do alternate implementations exist? I.e. in Python or Matlab.
Optional: is there a presentation/overview on the algorithm?

Mid level

Is there a software design document?
Is test-data provided that can be used to run the code?

Low level

Are all functions documented?
Is it clear what each (part of the) function is doing?
Are all variables documented?
Is it clear what each variable means?

Action points

A good communication-plan to get questions answered quickly. This includes a direct contact and regular calls.
Walk the code and improve/update the existing in-code documentation. Often it’s not looked at since it was written.

Level 2: Testability

Goal: Are there few bugs and can new bugs be easily detected?

High level

Is there a golden standard of the standard, such like a proof-of-concept or the existing code? Is it available to us?
Are the outputs deterministic? Or can they be made deterministic quickly?
Is there a good understanding where the algorithm is less stable to unstable, and can this be explained?
Is there clarity on the required maximum quantitative errors that would define the correctness of output? Are there high level test-cases for all of these?
Is there clarity on required maximum qualitative errors that would define the correctness of output? Is the collection of examples and counter-examples large enough to define the error in full?
Can the whole library tested using the sample input and output?
Is there clarity on the required precision? When is an answer correct?

Mid Level

Does the CI contain static code analysis?
Are there test-cases or automated tools for finding qualitative errors?

Low level

Are the compute-intensive functions covered by functional tests?
Are the most important functions covered by functional tests?
Are the other functions covered by functional tests?

Action points

Making the code deterministic by temporary removing random number generator.
Define what is correct output in detail.
Complete the test cases for the different types of errors.
Creating several sets of data-in and its correct data-out. This will be used for acceptance of ported code, giving the maximum errors allowed.
Decide which sub-results need to be compared.

Level 3: Quality

Goal: Is the software easy to maintain and extend?

High level

Is there build-automaton, like Cmake or Meson?
Is the code multi-OS?
Are tests run with every commit, or at least daily/nightly?

Mid level

Are functions defined to only do one thing and do it well?
Nobody of the team labelled the code as “Spaghetti code”?
Are function names self-explanatory?
Are variable-names self-explanatory?

Low level

There is no duplicated code?
There is no code that actually can be made much less complex?
There are no functions or methods longer than 100 lines excluding documentation?
There are no large classes?
There are no functions or methods with far more than 7 parameters?
There is only few commented out code?
There is only few dead code? This is code that is not being called from anywhere.
There is no code that should be rewritten?
There is no variables that are being reused?
There is no significant dependence on global state?

Action points

Prepare the (new) GPU-code for continuous integration.
Document how the ported code maps to the existing code.

Costs Assessment

Code at level 0 can take 10 – 20 times more time than code at level 1. How much exactly is difficult to say, and that’s exactly the reason we require a minimal level. The bitter pill is that the costs of not cleaning up the code is often even higher due to increased hardware costs, increased maintenance costs and lack of innovation.

There is only one reason to keep code quality under level 1: when it’s going to be replaced within a month.

When level 1 is mostly done, each missing item can take 20% to 40% of extra costs. A good example is not having good test-cases or no CPU-code available or not even an executable. This adds costs in both the implementation phase and the acceptance phase. Making a new CPU-implementation first is often cheaper.

From the described minimal level (level 1 in full, level 2 the high level) costs are more predictable and less costly. When getting a quote, these can be requested to be mentioned and you can choose to do it yourself or let us do it.

Contact

Just discuss your goals, after done an assessment. We can guide you in prioritization of making your code ready for porting.

Email to info@streamhpc.com or look to the other contact options on the contact page here. Then we’ll schedule a call and discuss what you need.

Copyright

If you find this self-assessment useful, know it took us a long time to improve this document and a lot of experience is hidden in it. But as we find quality software very important, we’re releasing this list under CC-BY-NC-ND: it cannot be altered or used commercially, and it must have a clear reference to Stream HPC as the authors. In other words: it must be clear that you did not do the research or writing of this assessment yourself.

If you have feedback or suggestions, we’d really like to hear from you!

Caffe and Torch7 ported to AMD GPUs, MXnet WIP

Posted by Vincent Hindriksen on 15 May 2017

Last week AMD released ports of Caffe, Torch and (work-in-progress) MXnet, so these frameworks now work on AMD GPUs. With the Radeon MI6, MI8 MI25 (25 TFLOPS half precision) to be released soonish, it’s ofcourse simply needed to have software run on these high end GPUs.

The ports have been announced in December. You see the MI25 is about 1.45x faster then the Titan XP. With the release of three frameworks, current GPUs can now be benchmarked and compared.

Especially the expected good performance/price ratio will make this very interesting, especially on large installations. Another slide discussed which frameworks will be ported: Caffe, TensorFlow, Torch7, MxNet, CNTK, Chainer and Theano.

This leaves HIP-ports of TensorFlow, CNTK, Chainer and Theano still be released. Continue reading “Caffe and Torch7 ported to AMD GPUs, MXnet WIP” →

Help us find our future COO

Posted by Vincent Hindriksen on 20 July 2018

*Is this a motto that goes with your personality? Then we want to talk with you.*

About 7 years ago we were still dealing with the usual peaks and lows of consultancy.

I’d like to get your help to find our future COO to help streamline this growth.

You might have seen that there are hardly any new blog posts – now you know why. By helping us find that special person, there can be put more time of writing new blog posts again.

If you know the perfect person for this job in Amsterdam, please let them know there is this unique company looking for her or him. Sharing this blog-post would help a lot.

You can find more information in this job-post:

We all know that quality comes with attention to detail, but also that with growth the details are the first to be postponed. We seek help in handling daily operations during our growth. The most important tasks are:

Customer contact. You make sure the communication is regular and smooth with all our customers, making them more engaged and happy with us.

Sales follow up. You take over to discuss the needs of potential customers pre-sales has had contact with.

Team support. You help the development-teams to get even better by helping them to solve their daily and long-term problems.

The job is very broad, but is all around a listening ear and getting things done.

You have studied business administration or alike, and have a can-do attitude. You know how to work with technical people and are a real team-player. You understand how to develop and engage group dynamics.

Do you think this is a job written for you, then we would like to hear more from you! Send an email to jobs@streamhpc.com with a motivational letter and listing relevant experience.

Thanks for helping out!

If you got sent here, we hope to hear from you!

Our Team

In 2025 we celebrated the company being 10 years, and brought everybody to Amsterdam. The photo below was taken on the boat-trip.

StreamHPC is the best known company in GPU-computing (OpenCL/CUDA/HIP/SYCL). We are also active in related technologies like Cloud-computing, embedded development, algorithm-design, graphics development (OpenGL/VULKAN), Machine Learning, and HPC (OpenMP/MPI).

We are distributed between mainly Amsterdam and Budapest.

The developers, the heart of the company

The company consists of highly skilled developers and low-level performance engineers. We mostly manage ourselves, but always with support of seniors and management. This way we have influence by showing ownership.

Each employee regularly shares their experience and checks the work of colleagues, to keep the standards high. This results in faster deliveries with higher quality of code, for which we’ve been complimented often.

Want to work at StreamHPC too? Check our jobs-page.

Hire the experts

On average we have our pipeline full for 3-6 months, but always reserve time for shorter projects (maximum a month).

Call +31 854865760 or mail to info@streamhpc.com or fill in the contact form to have a chat on how we can solve your software performance problems or do your software development.

PDFs of Monday 5 September

Posted by Vincent Hindriksen on 5 September 2011

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
LAMA: Library for Accelerated Math Applications for C/C++.

Rebranding the company name from StreamComputing to StreamHPC

Posted by Vincent Hindriksen on 11 May 2017

Since 2010 the name StreamComputing has been used and is widely known now in the GPU-computing industry. But the name has three problems: we don’t have the .com domain, it does not directly show what we do, and the name is quite long.

Some background on the old domain name

While the initial focus was Europe, for years our projects are done for 95% for customers outside the Netherlands and over 50% outside Europe – with the .eu domain we don’t show our current international focus.

But that’s not all. The name sticks well in academics, as they’re more used to have longer names – just try to read a book on chemistry. Names I tested as alternatives were not well-received for various reasons. Just like “fast” is associated with fast food, computing is not directly associated with HPC. So fast computing gets simply weird. Since several customers referred to us as Stream, it made much sense to keep that part of the name.

Not a new begin, but a more focused continuation

Stream HPC defines more what we are: we build HPC software. Stream HPC combines the well-known name combined with our diverse specialization.

Programming GPUs with CUDA or OpenCL.
Scaling code to multiple CPU and GPUs
Creating AI-based software
Speeding up code and optimizing for given architectures
Code improvement
Compiler tests and compiler building (LLVM)

With the HPC-focus we were more able to improve ourselves. We have put a lot of time in professionalizing our development-environment and project-management by implementing suggestions from current customers and our friends. We were already used to work fully independent and be self-managed, but now we were able to standardize more of it.

The rebranding process

Rebranding will take some time, as our logo and name is in many places. For the legal part we will take some more time, as we don’t want to get into problems with i.e. all the NDAs. Email will keep working on both domains.

We will contact all organizations we’re a member of over the coming weeks. You can also contact us, if you read this.

StreamComputing will never really go away. The name was with us for 7 years and stands for growing with the upcoming of GPU-computing.

Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

development of extreme fast software,
design of (faster) algorithms,
accelerating existing software, and
provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

Our training concepts for GPGPU

Posted by Vincent Hindriksen on 14 May 2010

It’s almost time for more nerdy stuff we have in the pipe-line, but we’ll keep for some superficial blah for a moment. We concentrate on training (and consultancy). There is a lot of discussion here about “how to design training-programs about difficult concepts for technical people”, or better: “how to learn yourself something difficult”. At the end of this blog, we’ll show you a list how to learn OpenCL yourself, but before that we want to share how we look at training you.

Disclaimer: this blog item is positive about our own training-program for obvious reasons. We are aware people don’t want (too much) spam, so we’ll keep this kind of blogs to the minimum. If you want to tell the world that your training-program is better, first mail us for our international partner-program. If you want the training, come back on 14 June or mail us.

OpenCL and CUDA are not the easiest programming languages due to incomparable concepts in software-land (You can claim Java is “slightly” different). Can the usual ways of training give you the insights and facts you need to know?

Current programs

Most training-programs are vendor-supported. People who follow us on Twitter, know we are not the best supporters of vendor locked products. So lets get a list of a typical vendor-supported training-programs, I would like to talk about:

They have to be difficult, so the student accomplishes something.
The exam are expensive to demotivate trial-on-error-students.
You get an official certificate, which guarantees a income-raise.
Books and trainings focus on facts you must learn.
It’s very clear what you must learn and what you can skip.

So in short, you chose you wanted to know the material and put a lot of effort in it. You get back more than just the knowledge.

Say you get the opposite:

They are easy to accomplish.
Exams is an assignment which you only need to finish. You can try endless times.
You don’t get a certificate, but you might get feedback and homework for self-study.
You get a list of facts you must learn; the concepts are explained to support this.
You are free to pick which subject you like.

That sucks! You cannot brag about your accomplishments and after the training you still cannot do anything with it; it will probably take years to actually finish it. So actually it’s very clear why the programs are like this, or can we learn from this opposing list? Just like with everything else, you never have to just copy what’s available but pick out the good parts.

Learning GPGPU

If you want to learn GPGPU, you have to learn (in short) shader-concepts, OpenCL, CUDA and GPU-architectures. What would be needed to learn it, according to us?

A specified list of subjects you can check when understood.
An insight story of the underlying concepts to better understand the way stream-computing works. Concepts are the base of everything, to actually make it sound simple.
Very practical know-how. Such as how to integrate stream-computing-code into your current software.
A difficult assignment that gets you in touch with everything you learned. The training gave you the instruments you need to accomplish this step.

So there’s no exam and no certificate; these are secondary reasons for finishing the course. The focus should be getting the brain wrapped around the concepts and getting experience. As the disclaimer warned you, our training-program has a high focus on getting you up-and-running in one day. And you do get a certificate after your assignment gets approved, so bragging is easy.

If you want to learn stream-computing and you won’t use our training-program, what then?

Read our blog (RSS) and follow us on Twitter.
Make yourself a list of subjects you think you have to learn. Thinking before doing helps in getting a focus.
Buy a book. There are many.
Play around with existing examples. Try to break it. Example: what happens if the kernel uses more and more local/private memory.
Update the list of subjects; the more extensive, the better. Prioritise.
Find yourself an assignment. For example: try to compress or decompress a large JPG using OpenCL. If you succeed, get yourself a harder assignment. Do you want to be good or the best?

If you know OpenCL, CUDA is easy to learn! We will have some blogs which support your quest on learning OpenCL, so just start to dig in today and see you next time.

The history of the PC from 2000 – 2012

Posted by Vincent Hindriksen on 6 May 2011 with 1 Comment

After IBM-compatible clones took over from Apple, Atari and ZX Spectrum, we just got used to that a PC is an X86 with MS Windows and Office on it. Around a decade ago Apple fought back with OSX on which Windows 7 (launched in 2009) was the first real answer. Meanwhile Apple switched to Intel, since IBM was not fast enough with the development of the POWER-processor – a huge operation, which seemed a one-time-only step for Apple at the time. SemiAccurate now speaks of Intel being replaced by ARM on Apple’s laptops.

A few weeks ago I asked Computer Science students if they knew ARM. Not even 1% had heard of it, but lots more knew there was a Samsung-chip in their smartphone. So what’s going on without us knowing it?

I’ll try to describe the market for a few key-years and then try to put the big names in it. There is a lot going on between i.e. Nvidia, Samsung, Texas Instruments and Imagination Technologies in the ARM-market, but I’ll leave that out of the story. Also not mentioned are the game-consoles and servers, but they did have big influences on the home-PC market.

In the picture at the right you see an idea of how fast the markets would have grown from a 2006 perspective. (Click on it for the full report). You see that the explosive growth of smartphones was not expected; the other detail is that the cloud also was not foreseen here.

After reading you understand why Nvidia focuses so much on HPC and mobile.

Continue reading “The history of the PC from 2000 – 2012” →

European HPC Magazines

Posted by Vincent Hindriksen on 10 January 2014

If one thing can be said about Europe, is that it is quite diverse. Each country solves or fails to solve its own problems individually, while European goals are not always well-cared for. Nevertheless, you can notice things changing. One of the areas where things have changed, is that of HPC. HPH has always been a well-interconnected research in Europe (with its centre in CERN), but there is a catch-up going on for the European commercial market. The whole of Europe has new goals set for better collaboration between companies and research institutes with programs like Horizon 2020. This means that it becomes necessary to improve interconnections among much larger groups.

In most magazines HPC is a section of a broader scope. This is also very important as this introduces HPC to more people. Now, I’d like to concentrate on the focus magazines. There are mainly two magazines available: Primeur Magazine and HPC Today.

Primeur Magazine

De Netherlands based magazine Primeur-magazine has been around for years, with HPC-news from Europe, video-channel, knowledge-base, calendar and more. Issues of past weeks can be read online (for free), but news can also be delivered via a weekly e-mail (paid service, prices range from €125 to €4000 per company/institute, depending on size).

They focus on being a news-channel for what is going on in in the HPC-world, both in the EU and the US. Don’t forget to follow them on Twitter.

HPC Today

Update: the magazine changed its name from HPC Magazine to HPC Today.

With several editions (Americas, Europe and France), websites and TV channels, the France based HPC Today brings an actionable coverage of the HPC and Big Data news, technologies, uses and research. Subscriptions are free, as the magazine is paid-for by advertisements. They balance their articles by targeting both people who deeply understand malloc() and people who want to know what is going on. Their readers are developers and researchers from both academic and private sectors.

With the change to HPC Today the content has slightly changed according the requests from the readers: less science, more HPC news. For the rest it’s about the same.

To get an idea of how they’re doing, check the partners of HPC Magazine: Teratec, ISC events and SC conference.

Other European HPC sources

Not all information around the web is nicely bundled in a PDF. Find a small list below to help you start.

InSiDE

The German National Supercomputing Centers HLRS, LRZ, NIC publish the online magazine InSiDE (Innovatives Supercomputing in Deutschland) twice a year. The articles are available in html and PDF. It gives a good overview of what is going on in Germany and Europe. There are no ways to subscribe via e-mail, so it would be better to put it in your calendar.

e-IRG

The e-Infrastructure initiative‘s main goal is to support the creation of a political, technological and administrative framework for an easy and cost-effective shared use of distributed electronic resources across Europe.

e-IRG is not a magazine, but it is a good start to find information about HPC in Europe. Their knowledge-base is very useful when trying to get an overview what is there in Europe: Projects, Country-statistics, Computer centers and more. They closely collaborate with Primeur-magazine, so can you see some overlap in the information.

PRACE Digest

PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery, as well as engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to achieve this mission by offering world class computing and data management resources and services through a peer review process.

The PRACE digest appears as twice a year as a PDF.

More?

Did we miss an important news-source or magazine? Let us know in the comments below!