Gedit OpenCL Syntax Highlighting

Posted by Vincent Hindriksen on 1 February 2011 with 1 Comment

Update 17-06-2011: updated version of opencl.lang and added opencl_host.lang.

When learning a language it is nice to do it the hard way, so you take the default txt-file editor provided with your OS. No colours, not help, no nothing, pure hard-core learning. But in Linux-desktop Gnome the default editor Gedit is quite powerful without doing too much, has an official Windows-port and has a OSX Darwin-port. It took just a few hours to understand how highlighting in Gedit works and to get it implemented. I got some nice help from the work done at the cuda-highlighter by Hüseyin Temucin (for showing how to extend the c-highlighter the best way) and the VIM OpenCL-highlighter by Terence Ou (for all the reserved words). This is work in progress; I will tell about updates via Twitter.

Get it

Windows-users first need to download Gedit for Windows. OSX-folks can check Darwin-ports. Then the files opencl.lang (.cl-files) and opencl_host.lang (extension of c to highlight OpenCL-keywords) needs to be put in /usr/share/gtksourceview-2.0/language-specs/ (or in ~/.local/share/gtksourceview-2.0/language-specs/ for local usage only), or for Window in C:Program Filesgeditsharegtksourceview-2.0language-specs or for OSX in /Applications/gedit.app/Contents/Resources/share/gtksourceview-2.0/language-specs/. Make sure all Gedit-windows are closed so the configuration will be re-read, and then open a .cl-file with Gedit. If you have opened cl-files as C or Cuda, you have to set the highlighting to OpenCL manually (under view -> highlighting). For host-code you always need to set the highlighting manually to “OpenCL host”. You might want to associate cl-files with Gedit.

Alternatives

VIM: http://www.vim.org/scripts/script.php?script_id=3157

Notepad++: http://sourceforge.net/tracker/?func=detail&aid=2957794&group_id=95717&atid=612384

SciTE: http://forums.nvidia.com/index.php?showtopic=106156

StreamHPC is working on Eclipse-support and I’ve understood also work is done for Netbeans-support. Let me know if there are more alternatives.

OpenCL under Wine

Posted by Vincent Hindriksen on 21 January 2011 with 4 Comments

The Wine 1.3 branch has support for OpenCL 1.0 since 1.3.9. Since Microsoft likes to get a little part of the Linux-dominated HPC-market, support for GPGPU is pretty good under the $799.00 costing Visual Studio – the free Express-version is not supported well. But why not take the produced software back via Wine? Problem is that OpenCL is not in the current Wine binaries for some reason, but that is fixable until we wait for inclusion…

Lazy or not much time? You can try my binaries (Ubuntu 32, NVIDIA), but I cannot guarantee they work for you and it is on your own responsibility: download (reported not working by some). See second part of step 3, what to do with it.

All the steps

I assume you have the OpenCL-SDK installed, but let me know if I need to add more details or clear up some steps.

1 – get the sources

The sources are available here. Be sure you download at least version 1.3.9. Alternatively you download the latest from git. You can get it by going to a directory and execute:

git clone git://source.winehq.org/git/wine.git

A directory “wine” will be created. That was easy, so lets go to bake some binaries.

Continue reading “OpenCL under Wine” →

Installing and using Portable Computing Language (PoCL)

Posted by Vincent Hindriksen on 8 July 2013 with 1 Comment

Update August’13: 0.8 has been released

PoCL stands for Portable Portable Computing Language and the goal is to make a full and open source implementation of OpenCL 1.2 for LLVM.

This is about installing and using PoCL on Ubuntu 64. If you want to put some effort to build it on Windows, you will certainly help the project. See also this TODO for version 0.8, if you want to help out (or want to know its current state). Not all functionality is implemented, but the project progresses using test-driven development – using the samples in the SDKs as a base.

Backends

They are eager for collaboration, so new backends can be added. For what I’ve seen this project is one of the best starts for new OpenCL-drivers. First because of the work that already has been done (implement by example), second because it’s an active open source project (continuous post-development), third because of the MIT-license (permits reuse within proprietary software). Here at StreamHPC we keep a close eye on the project.

On a normal desktop it has only one device and that’s the CPU. It has backends for several other types of CPUs (check ./lib/kernel in the source):

ARM
Cell SPU
Powerpc
Powerpc64
x86_64

Also the TCE libraries can be used as backend. The maturity of each backend differs.

More support is coming. For instance Radeon via the R600 project, but PoCL first needs to support LLVM 3.3 for that.

It is also a good start to build drivers for your own processor (contact us for letting us assist you in building such backend, see Supporting OpenCL on your own hardware for some related info)

Continue reading “Installing and using Portable Computing Language (PoCL)” →

How many threads can run on a GPU?

Posted by Vincent Hindriksen on 24 January 2017 with 5 Comments

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?” →

How to speed up Excel in 6 steps

Posted by Vincent Hindriksen on 19 June 2018

After the last post on Excel (“Accelerating an Excel Sheet with OpenCL“), there have been various request and discussions how we do “the miracle”. Short story: we only apply proper engineering tactics. Below I’ll explain how you can also speed up Excel and when you actually have to call us (last step).

A computer can handle 10s of gigabytes per second. Now look how big your Excel-sheet is and how much time it takes. Now you understand that the problem is probably not your computer.

Excel is a special piece of software from a developer’s perspective. An important rule of software engineering is to keep functionality (code) and data separate. Excel mixes these two as no other, which actually goes pretty well in many cases unless the data gets too big or the computations too heavy. In that case you’ve reached Excel’s limits and need to properly solve it.

An Excel-file often does things one-by-one, with a new command in every cell. This prevents any kind of automatic optimizations – besides that, Excel-sheets are very prone to errors.

Below are the steps to go through, of which most you can do yourself!

Continue reading →

DirectCompute’s unpopularity

Posted by Vincent Hindriksen on 28 December 2010

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity” →

Verify your OpenCL and CUDA kernels online for race conditions

Posted by Vincent Hindriksen on 13 May 2013 with 1 Comment

GPUVerify is a tool for formal analysis of GPU kernels written in OpenCL and CUDA. The tool can prove that kernels are free from certain types of defect, such as data races and bugs. This is quite useful feedback for any GPU-programmer.

Below you find a online version of the tool (please don’t break it!). Play around and test your kernels. Be aware the number of groups is the global worksize divided by local worksize.

For demo-purposes some values have been pre-filled with a simple kernel – press “Check my OpenCL kernel” to find the results. Did you expect this from this kernel? Can you explain the result?

After the LEAP-conference I’ll extend this article – till then I’m too time-limited. For now I wanted to share the online version with you, especially with the people who will attend the tutorial at LEAP. Be sure to check out the GPUVerify website and paper to learn more about this fantastic tool! Continue reading “Verify your OpenCL and CUDA kernels online for race conditions” →

Khronos Releases OpenCL 2.2 With SPIR-V 1.2

Posted by Vincent Hindriksen on 16 May 2017 with 1 Comment

Today Khronos has released OpenCL 2.2 with SPIR-V 1.2.

The most important changes are:

A static subset of the C++14 standard as a kernel language. The OpenCL C++ kernel language includes classes, templates, lambda expressions, function overloads and many other constructs to increase parallel programming productivity through generic and meta-programming.
Access to the C++ language from OpenCL library functions to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
Pipe storage, which are compile time pipes. It’s a device-side type in OpenCL 2.2 that is useful for FPGA implementations to enable efficient device-scope communication between kernels.
Enhanced optimization of generated SPIR-V code. Applications can provide the value of specialization constants at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
KhronosGroup/OpenCL-Headers repository has been flattened. From now on, all version of OpenCL headers will be available not at separate branches, but all in master branch in separate directories named opencl10, opencl11 etc. Old branches are not removed, but they may not be updated in the future.
OpenCL specifications are now open source. OpenCL Working Group decided to publish sources of recent OpenCL specifications on GitHub, including just released OpenCL 2.2 and OpenCL C++ specifications. If you find any mistake, you can create an appropriate merge request fixing that problem.

This is what we said about the release:

“We are very excited and happy to see OpenCL C++ kernel language being a part of the OpenCL standard,” said Vincent Hindriksen, founder and managing director of StreamHPC. “It’s a great achievement, and it shows that OpenCL keeps progressing. After developing conformance tests for OpenCL 2.2 and helping finalizing OpenCL C++ specification, we are looking forward to work on first projects with OpenCL 2.2 and the new kernel language. My team believes that using OpenCL C++ instead of OpenCL C will result in improved software quality, reduced maintenance effort and faster time to market. We expect SPIR-V to heavily impact the compiler ecosystem and bring several new OpenCL kernel languages.”

Continue reading “Khronos Releases OpenCL 2.2 With SPIR-V 1.2” →

NVIDIA’s answer to SandyBridge and Fusion

Posted by Vincent Hindriksen on 10 January 2011 with 1 Comment

Intel has Sandy Bridge, AMD has Fusion, now NVIDIA has a combination of CPU and GPU too: Project Denver. The only difference is that it is not X86-based, but an ARM-architecture. And most-probable the most powerful ARM-GPU of 2011.

For years there were ARM-based Systems-on-a-chip: a CPU and a GPU combined (see list below). On the X86-platform the “integrated GPU” was on the motherboard, and since this year now both AMD/ATI and Intel hit this “new market”.The big advantage is that it’s cheaper to produce, is more powerful per Watt (in total) and has good acceleration-potential. NVIDIA does not have X86-chips and would have been the big loser of 2011; they did everything to reinvent themselves: 3D was reintroduced, CUDA was actively developed and pushed (free libraries and tools, university-programs, many books and trainings, Tesla, etc), a mobile Tegra graphics solution [1] (see image at the right), and all existing products got extra backing from the marketing-department. A great time for researchers who needed to get free products in exchange of naming NVIDIA in their research-reports.

NVIDIA chose for ARM; interesting for who is watching the CUDA-vs-OpenCL battle, since CUDA was for GPUs of NVIDIA on X86 and ARM was solely for OpenCL. Period. In the contrary to their other ARM-based chips, this new chip probably won’t be in smartphones (yet); it targets systems that need more GPU-power like CUDA and games.

In a few days the article about Windows-on-ARM is to be released, which completes this article.

Continue reading “NVIDIA’s answer to SandyBridge and Fusion” →

PathScale ENZO

Posted by Vincent Hindriksen on 15 April 2011

My todo-list gets too large, because there seem so many things going on in GPGPU-world. Therefore the following article is not really complete, but I hope it gives you an idea of the product.

ENZO was presented as the alternative to CUDA and OpenCL. In that light I compared them to Intel Array Building Blocks a few weeks ago, not that their technique is comparable in a technical way. PathScale’s CTO mailed me and tried to explain what ENZO really is. This article consists mainly of what he (Mr. C. Bergström) told me. Any questions you have, I will make sure he receives them.

ENZO

ENZO is a complete GPGPU solution and ecosystem of tools that provide full support for NVIDIA Tesla (kernel driver, runtime, assembler, code generation, front-end programming model and various other things to make a developer’s life easier).

Right now it supports the HMPP C/Fortran programming model. HMPP is an open standard jointly developed by CAPS and PathScale. I’ve mentioned HMPP before, as it can translate Fortran and C-code to OpenCL, CUDA and other languages. ENZO’s implementation differs from HMPP by using native front-ends and does hardware optimised code generation.

You can learn ENZO in 5 minutes if you’ve done any OpenMP-like programming in the past. For example the Fortran-code (sorry for the missing indenting):

!$hmpp simple codelet, target=TESLA1

subroutine add(n, a, b, c)
implicit none
integer, intent(in) :: n
real, intent(in) :: a(n), b(n)
real, intent(out) :: c(n)
integer :: i

do i=1, n
if (a(i) > 5) then
c(i) = 1
else
c(i) = 2
endif
enddo

end subroutine add

subroutine test
integer, parameter :: n=10
real :: a(n), b(n), c(n)
integer :: i

do i=1, n
a(i) = i
b(i) = i
enddo

!$hmpp simple callsite
call add(n, a, b, c)
end subroutine test

This is somewhat different we know from OpenCL, mostly because we don’t need a specific kernel. This is because with just a few hints, the compiler does a lot for you. Like in OpenMP you tell the compiler with directives/pragmas which parts you want to be parallelised. More explanation can be found in the user manual [PDF]. You can try it out yourself for free if you have a Tesla-card; future versions of ENZO will support more architectures.

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

Posted by Vincent Hindriksen on 12 October 2011 with 6 Comments

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL” →

Guest-blog: Accelerating sequential machine vision algorithms with OpenMP and OpenCL

Posted by Vincent Hindriksen on 11 November 2013 with 1 Comment

Guest-blogger Jaap van de Loosdrecht wants to share his thesis with you. He leads the Centre of Expertise in Computer Vision department at NHL University of applied sciences and is the owner of his own company, and still managed to study and write a MSc-thesis. The thesis is interesting because it extensively compares OpenCL with OpenMP, especially chapters 7 an 8.

For those who are interested, my thesis “Acceleration sequential machine vision algorithms using commodity parallel hardware” is available at www.vdlmv.nl/thesis.

Keywords: Computer Vision, Image processing, Parallel programming, Multi-core CPU, GPU, C++, OpenMP, OpenCL.

Many other related research projects have considered using one domain specific algorithm to compare the best sequential implementation with the best parallel implementation on a specific hardware platform. This work was distinctive because it investigated how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms.This work has:

Examined, compared and evaluated 22 programming languages and environments for parallel computing on multi-core CPUs and GPUs.

Chosen to use OpenMP as the standard for multi-core CPU programming and OpenCL for GPU programming.

Re-implemented a number of standard and well-known algorithms in Computer Vision using both standards.

Tested the performance of the implemented parallel algorithms and compared the performance to the sequential implementations of the commercially available software package VisionLab.

Evaluated the test results with a view to assessing:

Appropriateness of multi-core CPU and GPU architectures in Computer Vision.

Benefits and costs of parallel approaches to implementation of Computer Vision algorithms.

Using OpenMP it was demonstrated that many algorithms of a library could be parallelized in an economical way and that adequate speedups were achieved on two multi-core CPU platforms. With a considerable amount of extra effort, OpenCL was used to achieve much higher speedups for specific algorithms on dedicated GPUs.

At the end of the project, the choice of standards was re-evaluated including newly emerged ones. Recommendations are given for using standards in the future, and for future research and development.

Algorithmic improvements are suggested for Convolution and Connect Component Labelling.

Your feedback and/or questions are welcome.

If you put comments here, I’ll make sure Jaap van de Loosdrecht will get to know and answer your questions on the subjects discussed in his thesis.

DOI: Digital attachments for Scientific Papers

Posted by Vincent Hindriksen on 13 March 2018

Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.

When we want to redo experiments of papers, it starts with finding the code and data used. A good start is Github or the homepage of the scientist. Also Gitlab. Bitbucket, SourceForge or the personal homepage of one of the researchers could be a place to look. Emailing the authors is often only an option, if the university homepage mentions such option – we’re not surprised to get no reaction at all. If all that doesn’t work, then implementing the pseudo-code and creating own data might be the only option – not if that will support the claims.

So what if scientific papers had an easy way to connect to digital objects like code and data?

Here the DOI comes in.

Continue reading “DOI: Digital attachments for Scientific Papers” →

StreamComputing is 7 years!

Posted by Vincent Hindriksen on 10 April 2017

As of 1 April we are 7 years old. Because of all the jokes on that day, this post is a bit later.

Let me take you through our journey how we grew up from a 1-person company to what we’re now. With pride I can say that (with ups and downs) StreamComputing (now rebranded to StreamHPC) has become a brand that equals to (extremely) fast software, HPC, GPUs and OpenCL.

7 years of changes

Different services

After 7 years it’s also time for changes. Initially we solely worked on OpenCL related services, mostly GPUs. And this is what we’re currently doing:

HPC GPU computing: OpenCL, CUDA, ROCm.
Embedded GPU computing: OpenCL, CUDA, RenderScript, Metal.
Networked FPGA programming: OpenCL.
GPU-drivers testing and optimisation.
Software architecture optimisations.

While you see OpenCL a lot, our expertise in vendor-specific CUDA (NVidia), ROCm (AMD), RenderScript (Google) and Metal (Apple) cannot be ignored. Hence the “Performance Engineers” and not “GPU consultants” or “OpenCL programmers”.

From Fixers to Builders and getting new competition

Another change is that we have been going from fixing code afterwards to building software.

This has been a slow process and had to do with the confidence in performance engineering as an expert profession instead of a trick. We’re seeing new companies coming into the market and providing GPU-computing next to their usual services. This is a sign of the market growing up.

We’re confident in growing further in our market, as we have the expertise to design fast software while the newcomers have gained expertise to write code that runs on the GPU with only little speedup.

Community: OpenCL:PRO to OpenCL.org

There have been more times when we wanted to support the community more. The first try was OpenCL:PRO and did not live long, as it was actually unclear to us what “the community” wanted.

In the end it was not that hard. Everybody who starts with OpenCL has the same problems:

Lack of convenience code, resulting in many, many wrappers and libraries that are incompatible.
Lack of practice projects.
Lack of overview on what’s available.

With OpenCL.org we aim to solve these problems together with the community. All is shared on Github and anybody can join to complete the information we’ve shared. While our homepage had around 40 pages on these subjects, it was only our personal view on the subjects or had outdated info.

So we’re going to donate most of the OpenCL-related technical pages we’ve written over the years to the community.

There is much more to share – watch our blog, the OpenCLorg twitter and newsletter!

Different Logo

For who remembered: in 2010 the logo looked quite different. We still use the blocks in the background (like on our Twitter account), but since 2014 the colours and font are quite different. This change has been going along with the company growing up. The old logo is careful, while the new one is bold – now we’re more confident about our expertise and value.

Over the past 3 years the new logo has stayed the same and has fully become our identity.

Same kind of customers

It has been quite a journey! We could not have done it without all the customers we served over those 7 years.

Thank you!

GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017

Posted by Vincent Hindriksen on 27 March 2017

What brings 2017 in technology? Gartner gives their vision with the start of each year to give insight in which technologies to invest in. When looking through them, the most important enabling technologies are the GPU and Internet-of-Things (IoT) – see the image below. Whereas the last 4 are IoT based, the first 4 would not have been possible without GPUs.

The middle two are more mature technologies, as they’re based on technology progress of many years – it happens to be that the GPU has played a big role to get here. And ofcourse not only GPUs and IoT are the reason these 10 are on this year’s list.

Continue reading “GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017” →

Intel CPUs, GPUs & Xeon Phi

[infobox type=”information”]

Need a XeonPhi or Intel OpenCL programmer? Hire us!

[/infobox]

Intel has support for all their recent CPUs which have SSE 4.x and AVX. Since SandyBridge the CPUs tend to have good performance. On IndyBridge and later there is also support for the embedded GPU (Windows-only). XeonPhi has support for OpenCL, even though they promote OpenMP most.

SDK

Intel does not provide a standard SDK kit which contains both hardware and software, as their hardware is broadly available.

The driver can be downloaded from the Intel OpenCL page – select your OS at the upper-right and click ‘Download’.

The samples are included with the driver, if you use Windows. They can be downloaded separately here. If you have Linux, you can download the samples which have been ported to GCC from our blog – here you can also read on how to install the SDK.

Tools

There are various developer tools available. You can find them here:

Offline compiler (stand-alone (Windows+Linux) and VisualStudio-plugin)
OpenCL – Debugger (VisualStudio only)
Integration with Graphics Performance Analyzers (Windows-dowload)
VTune Amplifier XE for code-optimisation (more info here, starting at $899 for both Windows and Linux)

Supported hardware

In short: all Ivy Bridge and Sandy Bridge processors.

Currently the HD4000 is the only embedded GPU that can do OpenCL, and only is supported via Windows drivers.

Xeon Phi

Intel’s official page has more info on the processor-card, and here you’ll find the most recent (public) info.

CPUs and GPUs

With Xeons of 12 to 16 cores and AVX2 (512 bits wide vectors), OpenCL works very well on CPUs.

For GPU bug-reports go to this forum.

Learning material

See this blog post for information on where to find all drivers and samples.

To optimise OpenCL for Intel-processors, you can go through their very nice Optimization Guide. There is also a nice overview of tips&tricks in this article. The Intel OpenCL forums are also a very good source of information.

How to get full CMake support for AMD HIP SDK on Windows – including patches

Posted by Máté Ferenc Nagy-Egri on 1 August 2023

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading →

StreamComputing exists 5 years!

Posted by Vincent Hindriksen on 25 May 2015 with 3 Comments

In January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
My friends, for understanding I needed to work in the weekends and evenings.
My good friend Laura for supporting me during the hard times of 2011 and 2012.
My girlfriend Elena for always being there for me.
My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!” →

SC15 news from Monday

Posted by Vincent Hindriksen on 16 November 2015 with 4 Comments

Warning: below is raw material, and needs some editing.

Today there was quite some news around OpenCL, I’m afraid I can’t wait till later to have all news covered. Some news is unexpected, some is great. Let’s start with the great news, as the unexpected news needs some discussion.

Khronos released OpenCL 2.1 final specs

As of today you can download the header files and specs from https://www.khronos.org/opencl/. The biggest changes are:

C++ kernels (still separate source files, which is to be tackled by SYCL)
Subgroups are now a core functionality. This enables finer grain control of hardware threading.
New function clCloneKernel enables copying of kernel objects and state for safe implementation of copy constructors in wrapper classes. Hear all Java and .NET folks cheer?
Low-latency device timer queries for alignment of profiling data between device and host code.

OpenCL 2.1 will be supported by AMD. Intel was very loud with support when the provisional specs got released, but gave no comments today. Other vendors did not release an official statement.

Khronos released SPIR-V 1.0 final specs

SPIR-V 1.0 can represent the full capabilities of OpenCL 2.1 kernels.

This is very important! OpenCL is not the only language anymore that is seen as input for GPU-compilers. Neither is OpenCL hostcode the only API that can handle the compute shaders, as also Vulkan can do this. Lots of details still have to be seen, as not all SPIRV compilers will have full support for all OpenCL-related commands.

With the specs the following tools have been released:

A bi-directional translator between LLVM to SPIR-V to enable flexible use of both intermediate languages in tool chains.
An OpenCL C to LLVM compiler that generates SPIR-V through the above translator, as Clang can compile OpenCL 1.2/2.0 C kernels.
A SPIR-V assembler and disassembler.

SPIRV will make many frontends possible, giving co-processor powers to every programming language that exists. I will blog more about SPIRV possibilities the coming year.

Intel claims OpenMP is up to 10x faster than OpenCL

The below image appeared on Twitter, claiming that OpenMP was much faster than OpenCL. Some discussion later, we could conclude they compared apples and oranges. We’re happy to peer-review the results, putting the claims in a full perspective where MKL and operation mode is mentioned. Unfortunately they did not react, as <sarcasm>we will be very happy to admit that for the first time in history a directive language is faster than an explicit language – finally we have magic!</sarcasm>

https://twitter.com/StreamHPC/status/666178711549583364

CT3iJBlVEAA2WCY — Left half is FFT and GEMM based, probably using Intel’s KML. All algorithms seems to be run in a different mode (native mode) when using OpenMP, for which intel did not provide OpenCL driver support for.

We get back later this week on Intel and their upcoming Xeon+FPGA chip, if OpenCL is the best language for that job. It ofcourse is possible that they try to run OpenMP on the FPGA, but then this would be big surprise. Truth is that Intel doesn’t like this powerful open standard intruding the HPC market, where they have a monopoly.

AMD claims OpenCL is hardly used in HPC

Well, this is one of those claims that they did not really think through. OpenCL is used in HPC quite a lot, but mostly on NVidia hardware. Why not just CUDA there? Well, there is demand for OpenCL for several reasons:

Avoid vendor lock-in.
Making code work on more hardware.
General interest in co-processors, not specific one brand.
Initial code is being developed on different hardware.
…

Thing is that NVidia did a superb job in getting their processors in supercomputers and clouds. So OpenCL is mostly run on NVidia hardware and a therefore the biggest reason why that company is so successful in slowing the advancement of the standard by rolling out upgrades 4 years later. Even though I tried to get the story out, NVidia is not eager to tell the beautiful love story between OpenCL and the NVidia co-processor, as the latter has CUDA as its wife.

Also at HPC sites with Intel XeonPhi gets OpenCL love. Same here: Intel prefers to tell about their OpenMP instead of OpenCL.

AMD has few HPC sites and indeed there is where OpenCL is used.

No, we’re not happy that AMD tells such things, only to promote its own new languages.

CUDA goes AMD and open source

AMD now supports CUDA! The details: they have made a tool that can compile CUDA to “HiP” – HiP is a new language without much details at the moment. Yes, I have the same questions as you are asking now.

Also Google joined in and showed progress on their open source implementation of CUDA. Phoronix is currently the best source for this initative and today they shared a story with a link to slides from Google on the project. the results are great up: “it is to 51% faster on internal end-to-end benchmarks, on par with open-source benchmarks, compile time is 8% faster on average and 2.4x faster for pathological compilations compared to NVIDIA’s official CUDA compiler (NVCC)”.

For compiling CUDA in LLVM you need three parts:

a pre-processor that works around the non-standard <<<…>>> notation and splits off the kernels.
a source-to-source compiler for the kernels.
an bridge between the CUDA API and another API, like OpenCL.

Google has done most of this and now focuses mostly on performance. The OpenCL community can use this to use this project to make a complete CUDA-to-SPIRV compiler and use the rest to improve POCL.

Khronos gets a more open homepage

Starting today you can help keeping the Khronos webpage more up-to-date. Just put a pull request at https://github.com/KhronosGroup/Khronosdotorg and wait until it gets accepted. This should help the pages be more up-to-date, as you can now improve the webpages in more ways.

More news?

AMD released HCC, a C++ language with OpenMP built-in that doesn’t compile to SPIRV.

There have been tutorials and talks on OpenCL, which I should have shared with you earlier.

Tomorrow another post with more news. If I forgot something on Sunday or Monday, I’ll add it here.

PDFs of Monday 19 September

Posted by Vincent Hindriksen on 19 September 2011

Already the fourth PDF-Monday. It takes quite some time, so I might keep it to 10 in the future – but till then enjoy! Not sure which to read? Pick the first one (for the rest there is not order).

Edit: and the last one, follow me on twitter to see the PDFs I’m reading. Reason is that hardly anyone clicked on the links to the PDFs.

I would like if you let others know in the comments which PDF you liked a lot.

Adding Physics to Animated Characters with Oriented Particles (Matthias Müller and Nuttapong Chentanez). Discusses how to accelerate movements of pieces of cloth attached to the bodies. Not time to read? There are nice pictures.

John F. Peddy’s analysis on the GPU market.

Hardware/Software Co-Design. Simple Solution to the Matrix Multiplication Problem using CUDA.

CUDA Based Algorithms for Simulating Cardiac Excitation Waves in a Rabbit Ventricle. Bioinformatics.

Real-time implementation of Bayesian models for multimodal perception using CUDA.

GPU performance prediction using parametrized models (Master-thesis by Andreas Resios)

A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations (Alexandre S. Nery, Nadia Nedjah, Felipe M.G. Franca, Lech Jozwiak)

Rapid Geocoding of Satellite SAR Images with Refined RPC Model. An ESA-presentation by Lu Zhang, Timo Balz and Mingsheng Liao.

A Parallel Algorithm for Flight Route Planning with CUDA (Master-thesis by Seçkîn Sanci). About the travelling salesman problem and much more.

Color-based High-Speed Recognition of Prints on Extruded Materials. Product-presentation on how to OCR printed text on cables.

Supplementary File of Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs (Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara). They sped up the BOINC-system (Folding@Home). Bit vague what they want to tell, but maybe you find it interesting.

Parallel Position Weight Matrices Algorithms (Mathieu Giraud, Jean-Stéphane Varré). Bioinformatics, DNA.

GPU-based High Performance Wave Propagation Simulation of Ischemia in Anatomically Detailed Ventricle (Lei Zhang, Changqing Gai, Kuanquan Wang, Weigang Lu, Wangmeng Zuo). Computation in medicine. Ischemia is a restriction in blood supply, generally due to factors in the blood vessels, with resultant damage or dysfunction of tissue

Per-Face Texture Mapping for Realtime Rendering. A Siggraph2011 presentation by Disney and NVidia.

Introduction to Parallel Computing. The CUDA 101 by Victor Eijkhout of University of Texas.

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing. Presentation on what you find out when putting the volt-meter directly on the GPU.

NUDA: Programming Graphics Processors with Extensible Languages. Presentation on NUDA to write less code for GPGPU.

Qt FRAMEWORK: An introduction to a cross platform application and user interface framework. Presentation on the Qt-platform – which has great #OpenCL-support.

Data Assimilation on future computer architectures. The problems projected for 2020.

Current Status of Standards for Augmented Reality (Christine Perey1, Timo Engelke and Carl Reed). not much to do with OpenCL, but tells an interesting purpose for it.

Parallel Computations of Vortex Core Structures in Superconductors (Master-thesis by Niclas E. Wennerdal).

Program the SAME Here and Over There: Data Parallel Programming Models and Intel Many Integrated Core Architecture. Presentation on how to program the Intel MIC.

Large-Scale Chemical Informatics on GPUs (Imran S. Haque, Vijay S. Pande). Book-chapter on the design and optimization of GPU implementations of two popular chemical similarity techniques: Gaussian shape overlay (GSO) and LINGO.

WebGL, WebCL and Beyond! A presentation by Neil Trevett of NVidia/Khronos.

Biomanycores, open-source parallel code for many-core bioinformatics (Mathieu Giraud, Stéphane Janot, Jean-Frédéric Berthelot, Charles Delte, Laetitia Jourdan , Dominique Lavenier , Hélène Touzet, Jean-Stéphane Varré). A short description on the project http://www.biomanycores.org.

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

Posted by Michael Noisternig on 18 March 2016 with 2 Comments

random_300 Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamHPC has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.

Generation

Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.