Valgrind suppression file for AMD64 on Linux

Posted by Vincent Hindriksen on 5 June 2014 with 2 Comments

Valgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux” →

AMD OpenCL Programming Guide August 2013 is out!

Posted by Vincent Hindriksen on 6 August 2013

AMD has just released an update to their AMD programming guide.

~~Download the guide (PDF) August version~~

Download the guide (PDF) November version

Download TOC (PDF)

For more optimisation guides, see the tutorials page of the knowledge base.

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview
1.1.1 Synchronization

1.2 Hardware Overview for Southern Islands Devices

1.3 Hardware Overview for Evergreen and Northern Islands Devices

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL Continue reading “AMD OpenCL Programming Guide August 2013 is out!” →

Intel CPUs, GPUs & Xeon Phi

[infobox type=”information”]

Need a XeonPhi or Intel OpenCL programmer? Hire us!

[/infobox]

Intel has support for all their recent CPUs which have SSE 4.x and AVX. Since SandyBridge the CPUs tend to have good performance. On IndyBridge and later there is also support for the embedded GPU (Windows-only). XeonPhi has support for OpenCL, even though they promote OpenMP most.

SDK

Intel does not provide a standard SDK kit which contains both hardware and software, as their hardware is broadly available.

The driver can be downloaded from the Intel OpenCL page – select your OS at the upper-right and click ‘Download’.

The samples are included with the driver, if you use Windows. They can be downloaded separately here. If you have Linux, you can download the samples which have been ported to GCC from our blog – here you can also read on how to install the SDK.

Tools

There are various developer tools available. You can find them here:

Offline compiler (stand-alone (Windows+Linux) and VisualStudio-plugin)
OpenCL – Debugger (VisualStudio only)
Integration with Graphics Performance Analyzers (Windows-dowload)
VTune Amplifier XE for code-optimisation (more info here, starting at $899 for both Windows and Linux)

Supported hardware

In short: all Ivy Bridge and Sandy Bridge processors.

Currently the HD4000 is the only embedded GPU that can do OpenCL, and only is supported via Windows drivers.

Xeon Phi

Intel’s official page has more info on the processor-card, and here you’ll find the most recent (public) info.

CPUs and GPUs

With Xeons of 12 to 16 cores and AVX2 (512 bits wide vectors), OpenCL works very well on CPUs.

For GPU bug-reports go to this forum.

Learning material

See this blog post for information on where to find all drivers and samples.

To optimise OpenCL for Intel-processors, you can go through their very nice Optimization Guide. There is also a nice overview of tips&tricks in this article. The Intel OpenCL forums are also a very good source of information.

The single-core, multi-core and many-core CPU

Posted by Vincent Hindriksen on 3 August 2017

CPUs are now split up in 3 types, depending on the number of cores: single (1), multi (2-8) and many (10+).

I find it more important now to split up into these three types, as the types of problems to be solved by each is very different. Based on the problem-differences I’m even expecting that the number of cores between multi-core CPUs and many-core CPUs will grow.

Below are the three types of CPUs discussed and a small discussion on many-core processors we see around. Continue reading “The single-core, multi-core and many-core CPU” →

Performance of 5 accelerators in 4 images

Posted by Vincent Hindriksen on 8 December 2015

running If there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip — Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

gflops-per-chip — The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

Intel Xeon 2699 v3: € 6,560.
Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
NVidia Tesla K40: € 4,070
AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

costs-per-gflops-per-chip — For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

OpenCL potentials: Watermarked media for content-protection

Posted by Vincent Hindriksen on 23 November 2011

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection” →

Reducing downtime with OpenCL… Ever thought of that?

Posted by Vincent Hindriksen on 3 November 2013

Something that creates extra value for Open CL is the flexibility with which it runs on an important variety of hardware. A famous strategy is running the code on CPUs to find data-races and debug the code more easily. Another is to develop on GPUs and port to FPGAs to reduce the development-cycles.

But there’s one, quite important, often forgotten: replacement of faulty hardware. You can blame the supplier, or even Murphy if you want, but what is almost certain is that there’s a high chance of facing downtime precisely when the hardware cannot be replaced right-away.

Fail to plan is planning to fail

To limit downtime, there are a few options:

Have a good SLA in place for 24/7 hardware-replacement.
Have spare-hardware in stock.
Have over-capacity on your compute-servers.

But the problem is that all three are expensive in some form if you’re not flexible enough. If you use professional accelerators like Intel XeonPhi, NVidia Tesla or AMD FirePro, you risk having unexpected stock shortage at your supplier.

With OpenCL the hardware can be replaced by any accelerator, whereas with vendor-specific solutions this is not possible.

Flexibility by OpenCL

I’d like to share with you one example how to introduce flexibility in your hardware-management, but there are various others which are more tailored to your requirements.

To detect faulty hardware, you can think of a server with three GPUs and let selected jobs be run by all three – any hardware-problem will be detected and pin-pointed. Administrating which hardware has done which job completes the mechanism. Exactly this can be used to replace faulty hardware with any accelerator: let the replacement-accelerator run the same jobs as the other two as an acceptance-test.

If you need your software to be optimised for several accelerators, you’re in the right place. We can help you with both machine and hand optimizations. That’s a plan that cannot fail!

CPU Code modernisation – our hidden expertise

Posted by Vincent Hindriksen on 22 June 2017 with 1 Comment

You’ve seen the speedups possible on GPUs. We secretly know that many of these techniques would also work on modern multi-core CPUs. If after the first optimisations the GPU still gets an 8x speedup, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU? Currently the GPU is chosen more often.

Now AMD, Intel and AMD have 28+ core CPUs, the answer to that question might now lean towards the CPU. With a CPU that has 32 cores and 256bit vector-computations via AVX2, each clock-cycle 32 double4 can be computed. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance. Actual performance compared to peak-performance is comparable to GPUs here. Continue reading “CPU Code modernisation – our hidden expertise” →

New training dates for OpenCL on CPUs and GPUs!

Posted by Vincent Hindriksen on 22 June 2017

OpenCL remains to be a popular programming language for accelerators, from embedded to HPC. Good examples are consumer software and embedded devices. With Vulkan potentially getting OpenCL-support in the future, the supported devices will only increase.

For multicore-CPUs and GPUs we now have monthly training dates for the rest of the year:

OpenCL CPU/GPU Training (4 days) from 17 to 20 July 2017 in Amsterdam
OpenCL CPU/GPU Training (4 days) from 21 to 24 August 2017 in Amsterdam
OpenCL CPU/GPU Training (4 days) from 18 to 21 September 2017 in Amsterdam
OpenCL CPU/GPU Training (4 days) from 16 to 19 October 2017 in Amsterdam
OpenCL CPU/GPU Training (4 days) from 20 to 23 November 2017 in Amsterdam
OpenCL CPU/GPU Training (4 days) from 18 to 21 December 2017 in Amsterdam

Minimum number of participants is two. By request the location and date can be changed.

The first day of the training is the OpenCL Foundations training, which can be booked separately.

For more information call us at +31854865760.

“Soon we will use only one thousandth of available computer capacity”

Posted by Vincent Hindriksen on 11 August 2015

Professor Henri Bal, who tries to wake up the Netherlands to start going big on parallel programming

At StreamHPC we mostly work for companies in the bigger countries of Europe and North America. We hardly work for companies in the Netherlands. But it seems that after 5 years of sleeping, there is some shaking. Below is a (translated) article with the above quote by Prof. Dr. Ir. Henri Bal, professor at the Computer section at the Vrije University of Amsterdam.

Lack of knowledge of parallel programming will cause a situation where only one thousandth of the capacity of computers will be used. This makes computations unnecessarily slow and inaccurate. That in turn will slow down the development of the Dutch knowledge economy.

Sequential programming, instructing computers to perform calculations in a queue, is now the standard. Computers processors, however, are much more sophisticated and able to perform thousands or even millions of computations simultaneously. But the programming of such many-cores “is still in its infancy, industries that rely heavily on data, can not perform optimally”, claims Ball.

The value of parallel programming, according to Ball, is of enormous importance, for example, meteorology and forensics. “For weather forecasting data from the dense network of computers need to be quickly and accurately processed to have a weather forecast for tomorrow, not after 48 hours,” he says. “In forensics all data should be explored in the first 24 hours after a crime as soon as possible and through pattern recognition all data, for no trace to be lost. The video material of 80,000 security cameras which was manually searched through after the attack on the London Underground in 2005 – with parallel computing methods this can now rapidly be executed by the computer.”

If the Netherlands wants to widen the gap investments are necessary, says Bal. The focus should be on research and teaching. “Investments in research on programming new massively-parallel machines are required to gain knowledge. Thus it must be examined how programs should be written for parallel computing methods and what extent of parallel calculations can be performed automatically. In teaching our future programmers need also to be prepared for the new standards of parallel programming. Only then the Netherlands can make optimal use of the available computer capacity. “

I think my fellow countrymen will be surprised they can find help just around the corner. And if they wait two more years, then 1000x speed-up from sequential programs are indeed becoming possible.

Have you seen similar articles that sequential programming is slowing the knowledge economy?

Four conferences that will interest you

Posted by Vincent Hindriksen on 12 April 2015

(if you get to Palo Alto, Manchester, Karlsruhe and Copenhagen)

We’re supporters of open standards and open discussions. When those thow come together, we melt. Therefore I’d like to share four hot conferences with you: IWOCL (Palo Alto, SF, USA), EMiT (Manchester, UK), ParallelCon (Karlsruhe, Germany), GPGPU-day 2015 (Copenhagen, Denmark).

On all these conferences I’ll be there too and are happy to meet you.

This post was shared first via the newsletter. Subscribe here.

Continue reading “Four conferences that will interest you” →

When Big Data needs OpenCL

Posted by Vincent Hindriksen on 25 August 2012

Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.

A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.

Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.

Continue reading “When Big Data needs OpenCL” →

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

What does Khronos has more to offer than OpenCL and OpenGL?

Posted by Vincent Hindriksen on 24 November 2014

The OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

OpenCL: compute
WebCL: web compute
SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
SYCL: high-level language for OpenCL

OpenGL related

Vulkan: state-less graphics
OpenGL: graphics
OpenGL ES: embedded graphics
WebGL: web graphics
glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
OpenGL SC: Graphics for Safety Critical operations
EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

OpenMAX: interface for multimedia codecs, platforms and hardware
StreamInput: interface for sensors
OpenVX: OpenCV-alternative, built for performance.
OpenKCam: interface for cameras and sensors

Others

COLLADA: 3D model file format
OpenSL ES: embedded audio
OpenVG: vector Graphics

One video called “OpenRoad” to show them all:

http://www.youtube.com/watch?v=ckD0op6OgMQ

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/

The magic of clGetKernelWorkGroupInfo

Posted by Vincent Hindriksen on 22 October 2015

It’s not easy to get the available private memory size – actually it’s impossible to get this information directly from the device/drivers, using the OpenCL API. This can only be explained after you dive deep into clGetKernelWorkGroupInfo – the function that tells you how well your kernel fits on the device. It is strange this function is not often discussed.

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

Returns the amount of local memory, in bytes, being used by a kernel (per work-group). Use CL_DEVICE_LOCAL_MEM_SIZE to find out the maximum.

CL_KERNEL_PRIVATE_MEM_SIZE

Returns the minimum amount of private memory, in bytes, used by each work-item in the kernel.

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

This answers the question “What is the maximum value for global_work_size argument that can be given to clEnqueueNDRangeKernel?”. The result is of type size_t[3].

CL_KERNEL_WORK_GROUP_SIZE

The is the same for local_work_size. The kernel’s resource requirements (register usage etc.) are used, to determine what this work-group size should be.

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

If __attribute__((reqd_work_group_size(X, Y, Z))) is used, then (X, Y, Z) is returned, else (0, 0, 0).

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

It returns a performance-hint: if the total number of work-items is a multiple of this number, then you’ll get good results. So no more remembering 32 or 64 for specific GPUs, but simply kick in a call to this function.

Combined with clDeviceInfo’s CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, you can fine-tune your workgroup-size in case you need the group-size to be as large as possible.

Software Performance is a Competitive Advantage

We are in the niche of GPGPU-computing, where GPUs are programmed to efficiently run scientific and large-scale simulations, AI training/inference and other mathematical compute-intensive software. As a recognized expert, customers from mostly US and Europe trust us to speed up their software.

Our projects range from several person-weeks to fix software performance problems, to several person-years to build extensive high performance software and libraries.

Join a growing list of companies that trust us with designing and building their core software with performance in mind.

A selection of Projects

From latest to oldest (2014):

Speeding up special purpose camera on mobile phones [C++, OpenCL, Vulkan]. Increasing the frame rate from stuttering frames to a responsive video-stream on a smartphone, made it possible to use the camera in new application areas.
Speeding up Generative AI software on MacOS [Objective C++, Metal]. Using MAC Studios with M1 and M2 chips, we reached theoretical max performance for offline Generative AI (making nice pictures).
Achieve 1 PFLOPS Attention on A Single H100 SXM [C++, CUDA]. We built the world’s first 1PF+ performance for the Attention-algorithm.
Writing a Compiler Test Suite for a C++ kernel language [OpenCL, C, C++]. For a large vendor we provided an extensive suite of tests to make sure the compiler is according specs. We made that update, which was a big change from 2.1 because of the addition of C++ kernels.
Porting GROMACS, OpenMM, AMBER and more to AMD MI100 GPUs [HIP, SYCL, C++, …]. AMD got awarded various supercomputers in 2022 and 2023 to use their GPUs, and it was therefore crucial to make sure that popular CUDA-optimized software would shine on AMD MI100 GPUs. While we were busy optimizing code, it also ran faster on Nvidia GPUs – this means the comparisons between Nvidia and AMD are fair, and not influenced by single-sided optimizations. If you run one of these softwares on your local supercomputer – you’re welcome. One example: Efficient molecular dynamics simulations on LUMI
Building the Khronos OpenCL SDK [OpenCL, C, C++]. It was always a wish to make OpenCL more than just the language. So we were happy when awarded Github
Speeding up pyPasWAS 3 to 5x [C, Python, OpenCL]. We boldly claimed that we could speed up this open-source software to do DNA/RNA/protein sequence alignment and trimming, and so we did. Speedup depends on the data. Read more on the blog
Building multiple libraries for AMD [HIP, C++]. Several foundational libraries on ROCm Github were built by us, and we still maintain. This project is still active.
- rocRAND [HIP, C++]. The world’s fastest random number generator (or second, depending on Nvidia’s response) is built for AMD GPUs, and it’s open source. With random numbers generated at several hundreds of gigabytes per second, the library makes it possible to speed up existing code numerous times. The code is often faster than Nvidia’s cuRAND and is therefore the preferred library to be used on any high-end GPU.
- rocThrust – AMD’s optimized version of Thrust [HIP, C++]. Highly optimized for CDNA GPUs. Lots of software for CUDA is Thrust based, and now has no lock-in anymore.
- hipCUB – AMD’s optimized version of CUB [HIP, C++]. Highly optimized for CDNA GPUs. Now porting CUB-based software to AMD is a lot simpler. Both rocThrust and hipCUB share a library rocPRIM which unites many of the GPU-primitives.
Porting a set of ADSL-algorithms to an embedded special purpose GPU [OpenCL, C, C++]. Allowing central ADSL-routers in large buildings to handle modern ADSL-protocols.
Optimizing and extending the main image processing framework of a large photo hosting platform [CUDA, C++, AWS]. This project is still active. Here we make sure that nobody notices that the original photos are optimized for the current screen on-the-fly, while also providing additional filters and features.
Flooding simulation [OpenCL, C++, MPI]. Software that simulates flooding of land, which we ported to multi-GPU on OpenCL and got a 35x speedup over MPI. Read more on the blog
Further speeding up CUDA-enabled Quantum Chemistry software [CUDA, C++], a general purpose quantum chemistry software, called TeraChem, designed to run on NVIDIA GPU architectures. Our work resulted in adding an extra 70% performance to the already optimized CUDA-code.
Porting Manchester University’s UNIFAC to OpenCL on XeonPhi [OpenCL, C++, MPI]. Even though XeonPhi Knights Corner is not a very performant accelerator, we managed to get a 160x speedup, starting from single threaded code. Most of the speedup is due to clever code-optimizations and less due to low-level optimizations. Where OpenMP could get the single threaded code down to about 8 seconds, we brought it down to 0.062 seconds. Read more on the blog
Porting Gromacs from CUDA to OpenCL [CUDA, OpenCL, C, C++]. Until we ported the simulation software end of 2014, it has been CUDA-only. Porting took several man-months to manually port all code. You can now download the source, build it and run it on AMD/Intel hardware. All is open source, so you can see our code. Read more on the blog. The backend has been deprecated in favor of SYCL.

We have helped many more companies become competitive. Some we could vaguely describe, and some we can’t mention. See below the programming languages we worked with, as not all show up in the above list.

Technologies we work with

GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017

Posted by Vincent Hindriksen on 27 March 2017

What brings 2017 in technology? Gartner gives their vision with the start of each year to give insight in which technologies to invest in. When looking through them, the most important enabling technologies are the GPU and Internet-of-Things (IoT) – see the image below. Whereas the last 4 are IoT based, the first 4 would not have been possible without GPUs.

The middle two are more mature technologies, as they’re based on technology progress of many years – it happens to be that the GPU has played a big role to get here. And ofcourse not only GPUs and IoT are the reason these 10 are on this year’s list.

Continue reading “GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017” →

Porting CUDA to OpenCL

OpenCL-in-CUDA-car — OpenCL speed-meter in 1970 Plymouth Cuda car

Why port your CUDA-acelerated software to OpenCL? Simply, to make your software also run on AMD CPU/APU/GPU, Intel CPU/GPU, Altera FPGA, Xilinx FPGA, Imagination PowerVR, ARM MALI, Qualcomm Snapdragon and upcoming architectures.

And as OpenCL is an open standard, supported by many vendors, it has much more security that it will keep existing in the future than any proprietary language.

If you look at the history of GPU-programming you’ll find many frameworks, such as BrookGPU, Close-to-Metal, Brook+, Rapidmind, CUDA and OpenCL. CUDA was the best choice from 2008 to 2013, as OpenCL had to catch up. Now that OpenCL is gaining serious market traction, the demand for porting legacy CUDA-code to OpenCL rises – as we clearly notice here.

We are very experienced in porting legacy CUDA-code to all flavours of OpenCL (CPU, GPU, FPGA, embedded). Ofcourse porting from OpenCL to CUDA is also possible, as well as updating legacy CUDA-code to the latest standards of CUDA 7.0 and later. We can also add several improvements to the architecture; we have made many customers happy with giving them more structured and documented code, while working on the port. Want to see some work we did? We ported molecular dynamics software Gromacs from CUDA to OpenCL.

[button text=”Request a pilot, code-review or more information” url=”https://streamhpc.com/consultancy/request-more-information/” color=”orange” target=”_blank”]

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Posted by Vincent Hindriksen on 23 August 2019

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading →

OpenCL.org internship/externship

Posted by Vincent Hindriksen on 15 July 2016

Want to help build an important website? OpenCL.org’s components have been designed and partly built, but still a lot of work needs to be done. We’re seeking an intern (or “extern” when not in Amsterdam) who can help us build the site. This internship is not about GPUs!

To complete the tasks, the following is required:

Technical expertise:
- HTML5, CSS
- PHP
- Javascript
- jQuery
- Node.js
- Mediawiki
- XSLT
Can-do mentality
Able to plan own work
Good communication-skills
Available for 3 to 6 months

We don’t expect you know all tools, so we will guide you in learning new tools and techniques. Write us a “email of interest” to info@streamhpc.com, and write what you can and what your objectives for an internship would be.

We’re looking forward to see your letter!

OpenCL at SC15 – the booths to go to

Posted by Vincent Hindriksen on 15 November 2015 with 1 Comment

This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.

Khronos

The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamHPC team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

Reflex | Booth: #3115.
BittWare | Booth #3010.
Nallatech | Booth #1639.
Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.

HPC OpenCL

Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.

Others

You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!

Table of Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

Need a XeonPhi or Intel OpenCL programmer? Hire us!

SDK

Tools

Supported hardware

Xeon Phi

CPUs and GPUs

Learning material

The numbers

Costs

Which to choose?

Fail to plan is planning to fail

Flexibility by OpenCL

(if you get to Palo Alto, Manchester, Karlsruhe and Copenhagen)

How OpenCL works

5 questions on OpenCL

OpenCL related

OpenGL related

Streaming input and output

Others

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

CL_KERNEL_PRIVATE_MEM_SIZE

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Read more?

We are in the niche of GPGPU-computing, where GPUs are programmed to efficiently run scientific and large-scale simulations, AI training/inference and other mathematical compute-intensive software. As a recognized expert, customers from mostly US and Europe trust us to speed up their software.

Our projects range from several person-weeks to fix software performance problems, to several person-years to build extensive high performance software and libraries.

Join a growing list of companies that trust us with designing and building their core software with performance in mind.

A selection of Projects

Technologies we work with

Khronos

OpenCL on FPGAs

Embedded OpenCL

HPC OpenCL

Others

Send your photos and news!