IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

StreamComputing is 2 years old! A personal story.

More than two years ago, on 13 January 2010, I wrote my first blog-post. Four months later StreamComputing (redacted: rebranded to StreamHPC in 2017) was both official and unknown. I want to share with you my personal story on how I got to start-up this company.

The push-factor

I wanted to create a company which was about innovative projects –  something I had hardly encountered until then. The years before I programmed parts of A-to-B-flows, as I call them. That is software that is in the base quite simple, but tediously discussed as very, very complex.

“Complex” software

The complexity is not the software, as you can see. It is undocumented APIs, forgotten knowledge, knowledge in heads of unknown people, bossy and demanding people who friendly ask for last-minute architecture changes, deadlines around promotion-rounds, new deadlines due to board-decisions, people being afraid of getting replaced if the software is finished, jealousy if another team makes version 2 of the software, etc. The rule of office-software is therefore understandable:

Software is either unfinished,
or turned into a platform for unintended functionality.

The fun in office-software is there for analyst, architect or manager – the developer just puts in his earphones and makes all the requested changes (hooray for services like Spotify). But as I did not want to become a manager and wished to keep improving my development skills, I had to conclude I was on the wrong track.

Continue reading “StreamComputing is 2 years old! A personal story.”

A typical week

Primary and secondary tasks

The main focus is programming and solving problems. But that means that everything that obstructs this focus, needs to be gotten out of the way. This is simpler on paper than in reality and therefore there are multiple “faiths” among company, how to do this.

We start with clearly distincting primary and secondary tasks, where the difference is that there needs to be more time spent on the primary tasks in the long term. The last part of the sentence is very important.

What we do every day and week:

  • Planning
    • Write issues
    • Make issue estimations
    • Prioritize issues
    • Bundle issues in epics
    • Pick issues for personal weekly milestones
  • Problem-solving
  • Coding and math
  • Learning
    • Reading books
    • Reading papers
    • Watching videos

Why so much emphasis on planning?

The planning-part takes good time, but refrains us from spending too much time on dead ends. And spending time on dead ends is not a primary task at all. Also planning helps with designing better strategies – there is limited time for solving problems and coding software, so doing a full-scope research is not going to work. As there is no way to efficiently build complex code without any time-estimations on the different approaches, planning-skills provide the necessary foundations for becoming a senior coder.

We start as early as possible to train these skills, so also juniors are asked to do all planning-tasks. Initially this takes a good part of the valuable coding-time but quickly goes down and first advantages are seen.

Style of project handling

Tools

We mostly use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”.

We continuously look into new tools that can help us improve. Also here the main focus is to reduce the time on secondary tasks, so we can spend more time thinking on problem-solving.

Pull-style project management

The tasks are written down by the team, using the project-doc as input. All these tasks are put into the task-list of the project and estimated. Then each team member picks the tasks that are a good fit. There are always tasks that need to be pushed instead of pulled, but luckily that’s a relatively small part of all work.

All code (MR) is checked by one or two colleagues, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

The week

If you’d work here, how would your week look like the first year? Specifically saying the first year, as for more complex projects, different approaches could be chosen.

Monday weekly planning

Together with your team you pick up the issues for the week. The issues should have estimations, or these will be done during that meeting. When your week is filled, you know what to do.

Monday weekly meeting

Every Monday we have a weekly meeting to share with everybody how the other projects are doing.

Mon-Fri: Daily standup

Retrospective of the previous day, and tuning of the day ahead.

Practice:

  • Tools
  • C/C++
  • GPGPU
  • Scrum

Friday closing

Weekly retrospective, cleaning up, writing notes on issues, etc.

Weekly customer meetings

Here we discuss the progress and anything blocking. The customer shares their progress, and together problems can be solved.

Many projects have a shared (high-level) issue-list, so the progress is continuously synced with the customer and communication is easy.

Let us do your peer-review

cuda-3-728There are many research papers that claim enormous speed-ups using an accelerator. From our experience a large part is because of code-modernisations (parallisation & optimisation), which makes the claim look false. That’s why we offer peer-reviews for half our rate for CUDA and OpenCL software. The final costs depend on the size and complexity of the code.

We will profile your CPU and Accelerator code on our machines and review the code. The results are the effect of the code-modernisations and the effect of using the accelerator (GPU, XeonPhi, FPGA). With this we hope that we stimulate the effect of code-modernization gets more research attention over using “miracle hardware”.

Don’t misunderstand: GPUs can still get an average of 8x speedup (or 700% speed improvement) over optimised code, which is still huge! But it’s simply not the 30-100x speed-up claimed in the slide at the right.

 

OpenCL potentials: Watermarked media for content-protection

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection”

Applied GPGPU-days Amsterdam 2013

6754632287-2December 2013: Videos are not ready yet, but link will be put here.

Amsterdam, 20 June – Applied GPGPU-days in Amsterdam. Keep your agenda free for this event.

What can you do with GPUs to speed up computations? This year we can see various examples where OpenCL and CUDA have been used. We hope to give you an answer if you can use GPUs for your software, research or algorithm.

After the success of last year (fully booked with 66 attendees), we now have reserved a larger location with place for 100 people. Difference with last year is that we focus more on applications, less on technical aspects.

The program has been made public recently:

Title of talk Company/Institute Presenter
Introduction to GPGPU and GPU-architectures StreamHPC Vincent Hindriksen
Blender Cycles & Tiles: Enhancing user experience AtMind bv Monique Dewanchand & Jeroen Bakker
XeonPhi vs K20: The fight of the titans SURFsara Evghenii Gaburov
A real-time simulation technique for ship-ship and ship-port interaction PMH bv Jo Pinkster
CUDA Accelerated Neural Networks LIACS Ana Balevic
Efficient Reconstruction of Biological Networks via Transitive Reduction on GPUs TU Eindhoven Anton Wijs
Running Petsc on GPUs with an example from fluid dynamics SURFsara Thomas Geenen
Connected Component Labelling, an embarrassingly sequential algorithm Leeuwarden University Jaap van de Loosdrecht
Visualizing sound and vibrations using a GPU and a 1024-channel microphone array TU Eindhoven Wouter Ouwens
Gravitational N-body simulations on 1 to many GPUs Leiden observatory Jeroen Bédorf

A few demos will be shown.

For more information, see the Platform Parallel webpage. Also to find other events by the platform.

Tickets are €75,-. If you are from a Dutch university or research institute affiliated with SURF, your ticket has been fully sponsored by SURFsara.

Associated events in the Netherlands

For the technical aspects (GPU-programming techniques, optimisation, etc) we have a special day: the GPU Dev Day 2013. More information on the Platform Parallel webpage. Date and place will be made public in June.

The first Khronos Meetup Benelux will take place just before the Applied GPGPU day, on 19 June in Amsterdam. More information on the meetup-page.

Basic concept: Hosts and devices

Time for some basic concepts of OpenCL. As I notice a growing number of visitors to this page, I notices I have actually not written much about coding and basics.

One of the first steps of an OpenCL program is selecting hosts and devices. If you program for a tablet, which has one chip and a screen, you don’t think of several devices. And if you log in on a server, your context is there is one host and that’s the one you logged into. If you have read my article about how to install all drivers on Ubuntu, you have gotten several clues. I added some tips&tricks, but not too many. If you know more stuff about this subject yourself, please share with others in the comments.

Continue reading “Basic concept: Hosts and devices”

DirectCompute’s unpopularity

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity”

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

What does Khronos has more to offer than OpenCL and OpenGL?

opencl_from_accelerate_your_worldThe OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

  • OpenCL: compute
  • WebCL: web compute
  • SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
  • SYCL: high-level language for OpenCL

OpenGL related

  • Vulkan: state-less graphics
  • OpenGL: graphics
  • OpenGL ES: embedded graphics
  • WebGL: web graphics
  • glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
  • OpenGL SC: Graphics for Safety Critical operations
  • EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

  • OpenMAX: interface for multimedia codecs, platforms and hardware
  • StreamInput: interface for sensors
  • OpenVX: OpenCV-alternative, built for performance.
  • OpenKCam: interface for cameras and sensors

Others

One video called “OpenRoad” to show them all:

http://www.youtube.com/watch?v=ckD0op6OgMQ

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/

Gedit OpenCL Syntax Highlighting

Update 17-06-2011: updated version of opencl.lang and added opencl_host.lang.

When learning a language it is nice to do it the hard way, so you take the default txt-file editor provided with your OS. No colours, not help, no nothing, pure hard-core learning. But in Linux-desktop Gnome the default editor Gedit is quite powerful without doing too much, has an official Windows-port and has a OSX Darwin-port. It took just a few hours to understand how highlighting in Gedit works and to get it implemented. I got some nice help from the work done at the cuda-highlighter by Hüseyin Temucin (for showing how to extend the c-highlighter the best way) and the VIM OpenCL-highlighter by Terence Ou (for all the reserved words). This is work in progress; I will tell about updates via Twitter.

Get it

Windows-users first need to download Gedit for Windows. OSX-folks can check Darwin-ports. Then the files opencl.lang (.cl-files) and opencl_host.lang (extension of c to highlight OpenCL-keywords) needs to be put in /usr/share/gtksourceview-2.0/language-specs/ (or in ~/.local/share/gtksourceview-2.0/language-specs/ for local usage only), or for Window in C:Program Filesgeditsharegtksourceview-2.0language-specs or for OSX in /Applications/gedit.app/Contents/Resources/share/gtksourceview-2.0/language-specs/. Make sure all Gedit-windows are closed so the configuration will be re-read, and then open a .cl-file with Gedit. If you have opened cl-files as C or Cuda, you have to set the highlighting to OpenCL manually (under view -> highlighting). For host-code you always need to set the highlighting manually to “OpenCL host”. You might want to associate cl-files with Gedit.

Alternatives

VIM: http://www.vim.org/scripts/script.php?script_id=3157

Notepad++: http://sourceforge.net/tracker/?func=detail&aid=2957794&group_id=95717&atid=612384

SciTE: http://forums.nvidia.com/index.php?showtopic=106156

StreamHPC is working on Eclipse-support and I’ve understood also work is done for Netbeans-support. Let me know if there are more alternatives.

OpenCL at SC15 – the booths to go to

SC15This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.

Khronos

The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamHPC team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

  • Reflex | Booth: #3115.
  • BittWare | Booth #3010.
  • Nallatech | Booth #1639.
  • Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.

HPC OpenCL

Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.

Others

You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!

PDFs of Monday 29 August

This is the first PDF-Monday. It started as I used Mondays to read up on what happens around OpenCL and I like to share with you. It is a selection of what I find (somewhat) interesting – don’t hesitate to contact me on anything you want to know about accelerated software.

Parallel Programming Models for Real-Time Graphics. A presentation by Aaron Lefohn of Intel. Why a mix of data-, task-, and pipeline-parallel programming works better using hybrid computing (specifically Intel processors with the latest AVX and SSE extensions) than using GPGPU.

The Practical Reality of Heterogeneous Super Computing. A presentation of Rob Farber of NVidia on why discrete GPUs has a great future even if heterogeneous processors hit the market. Nice insights, as you can expect from the author of the latest CUDA-book.

Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method (Thales Luis Rodrigues Sabino, Marcelo Zamith, Diego Brandâo, Anselmo Montenegro, Esteban Clua, Maurício Kischinhevksy, Regina C.P. Leal-Toledo, Otton T. Silveira Filho, André Bulcâo). GPU based cluster environment for the development of scalable solvers for a 3D wave propagation problem with finite difference methods. Focuses on scattering sound-waves for finding oil-fields.

Parallel Programming Concepts – GPU Computing (Frank Feinbube) A nice introduction to CUDA and OpenCL. They missed task-parallel programming on hybrid systems with OpenCL though.

Proposal for High Data Rate Processing and Analysis Initiative (HDRI). Interesting if you want to see a physics project where they did not have decided yet to use GPGPU or a CPU-cluster.

Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers (Naoya Maruyama, Tatsuo Nomura, Kento Sato and Satoshi Matsuoka). A collection of macros for GPGPU, tested on TSUBAME2.

AMD ROCm 1.5 Linux driver-stack is out

ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

About ROCm and it’s short history

The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

What does ROCm 1.5 bring?

Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

Other improvements:

  1. Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
  2. Added new API hipMemPtrGetInfo
  3. Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

Bug fixes:

  1. Fixed Copyright and header names
  2. Fixed issue with bit_extract sample
  3. Enabled lgamma and lgammaf
  4. Added guard for GFX8 specific intrinsics
  5. Fixed few issues with operator overloading of vector data types
  6. Fixed atanf
  7. Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
  8. Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

Current hardware support:

  • GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
  • GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

Install ROCm 1.5

Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

You need to alter this line with every update, else it’ll keep using the old version.

Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

/opt/rocm/hsa/sample$ sudo make
gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Also clinfo (installed from the default repo) should work.

Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

OpenCL Potentials: Investment-industry

This is the second in the series “OpenCL potentials“. I chose this industry because it is the finest example where you are always late, even if you were first. So it always must be faster if you want to make the better analyses. Before I started StreamHPC I worked for an investment-company, and one of the things I did was reverse engineering a few megabytes of code with the primary purpose of updating the documentation. I then made a proof-of-concept to show the data-processing could be accelerated with a factor 250-300 using Java-tricks only and no GPGPU. That was the moment I started to understand that real-time data-computation was certainly possible. Also that IO is the next bottle-neck after computional power. Though I am more interested in other types of research, I do have my background and therefore try to give an overview for this sector and why it matters.
Continue reading “OpenCL Potentials: Investment-industry”

A list of Desktop GPU architectures

p3-architectureUPDATED in February 2017

Some optimisation tricks work really well on one architecture, and are useless on others. And even with better drivers, the older architectures need some help. In other words, it helps to know what architecture the GPU has. Therefore you get some help from your friends at StreamHPC.

Below you’ll find a list of the architecture names of all OpenCL-capable GPU models of Intel, NVIDA and AMD. It does not contain the professional lines for now – first we are focusing on getting the general models right.

Understand it took a lot of time to gather the below information, and normally we share such information only with our clients.

Continue reading “A list of Desktop GPU architectures”

Xeon Phi Knights Corner compatible workstation motherboards

xeonphiIntel has assumed a lot if it comes to XeonPhi’s. One was that you will use it on dual-Xeon servers or workstations and that you already have a professional supplier of motherboards and other computer-parts. We can only guess why they’re not supporting non-professional enthusiasts who got the cheap XeonPhi.

After browsing half the internet to find an overview of motherboards, I eventually emailed Gigabyte, Asus and ASrock for more information for a desktop-motherboard that supports the blue thing. With the information I got, I could populate the below list. Like usual we share our findings with you.

Quote that applies here: “The main reason business grade computer supplies can be sold at a higher price is that the customers don’t know what they’re buying“. When I heard, I did not know why the customer is not well-informed – now I do. Continue reading “Xeon Phi Knights Corner compatible workstation motherboards”

OpenCL error codes (1.x and 2.x)

computer-says-no
Little Britain: “Compu’er says no”. (links to Youtube movie)

Knowing all errors by heart is good for quick programming, but not always the best option. Therefore I started to create a full list with extra info, taken from cl.h and the reference documentation.

The problem with many error-codes is that they are sometimes context-dependent and then become quite useless in helping the programmer out. Also some drivers return different error-codes. Notice also that different errors are given per OpenCL-version for the same function. If you find problems, help make OpenCL better and give feedback.

Want it on your wall? You can easily copy these two tables into Excel or alike software and print it out.

Continue reading “OpenCL error codes (1.x and 2.x)”

Our offices

We’re expanding to more cities, to be closer to talent and our customers. The idea is to have multiple smaller offices instead of a few big ones. The idea for this was a simple set of questions on how work would be in 2030. The lines between offices would be shifting – not all is to be defined by walls. So smaller offices nearby, with the flexibility to temporarily move to another city, would be much more suited for what is expected in 2030.

Each city has one or two senior developer+manager person, who takes lead when the project-complexity demands it.

In HQ the main structure is provided for onboarding, administration, sales and such. All to make sure the different cities only have a few local things to take care off, so the focus can be on building great software and efficiently handling the projects.

EU – NL – Amsterdam

Koningin Wilhelminaplein 1 – 40601, 1062HG, Amsterdam, Netherlands

Amsterdam is the economic center of the Netherlands, a small country with 17 million inhabitants. It’s the home of HPC-companies like Bright Computing and ClusterVision, and has a large IT workforce that also feed the R&D demand of large international companies. As the number of companies settling here is still growing, Amsterdam is even planning to build a complete new city for 40 to 70 thousand people in the harbour area.

There are different sides of the city. When you think of Amsterdam as a tourist, you might think of the Anne Frank House, Gay Parade, Van Gogh Museum, the Red Light District, the canals, windmills and Tulips. If you would consider living here there are about the 180 different nationalities that live in the city, the 22 international schools and two universities, the vibrant night life and the many-villages-make-the-city atmosphere. Locals of all professions are fluent in English and there is a lively expat community.

You don’t need to live in Amsterdam, as there are several cities and villages nearby with all unique identities. As the Dutch infrastructure is of high standard, Amsterdam is easy to reach via train (and car) from several nearby cities and villages. For instance taking the train from Haarlem to the office takes 9 to 13 minutes, Leiden or Utrecht half an hour. Want to live at the sea? Zandvoort to the office is 25 minutes.

Expats (both single and with family) say they found it easy to build up a social life. For Europeans it’s very easy to move to Amsterdam, as there are no real borders in the EU.

EU – HU – Budapest

Radnóti Miklós u. 2, Budapest, 1137, Hungary

Two cities, Buda and Pest with both their own characteristics form the 1,75 million large capital of Hungary and the ninth-largest city in the EU. The country (est. in 895) has almost 10 million inhabitants.

There is more high-tech industry than you might think. Hungary has one of the highest rates of filed patents, the 6th highest ratio of high-tech and medium high-tech output in the total industrial output, the 12th-highest research Foreign Direct Investment inflow, placed 14th in research talent in business enterprise and has the 17th-best overall innovation efficiency ratio in the world.

If you walk in the city, you’ll find no average Hungarian. There is much creativity hidden and there’s a rich beer-culture. There is this unique quiet vibrant atmosphere that makes you immediately feel at home.

EU – ES – Barcelona

Better weather during winter than in Amsterdam and Budapest and a vibrant tech-city. It hosts the famous Barcelona Supercomputing Center, and is strong tech-hub.

Contenders

We’re researching multiple cities for starting a new office. Due to Covid these researches have been delayed a lot.

  • EU – NL – Utrecht
  • EU – NL – Eindhoven
  • EU – PL – Warsaw
  • EU – FR – Paris
  • EU – FR – Grenoble
  • EU – DE – Heidelberg
  • UK – Bristol

If you live in one of these cities and are good with GPUs, do get in contact. We start with these people:

  • An experienced developer who can manage projects
  • Three to four medior/senior developers
  • A temporary “location starter”
  • Optionally a sales-person

AMD vs NVIDIA – Two figures that can tell a whole story

titanUpdate September ’13: AMD gets their new GPUs “Volcanic Islands” with GCN 2.0 out in October. For this reason the HD 7970’s price has dropped to €250. This shakes up some of the things described in this article.

Update June ’14: It has become clear that Titan is not a consumer device and should be categorised as a “Quadro for compute”. All consumer devices of both AMD and Nvidia show relatively low GFLOPS for dual precision.

Update July’14: Graphs updated with GTX Titan Z and R9 290X.

AMD/ATI has always had the fastest GPU out there. Yes, there were lots of times in which NVIDIA approached the throne, or even held the crown for a while (at least theoretically), but it was Radeon, at the end, the one who had the right claim.

Nevertheless, some things have changed:

  • AMD has focused more on the new architecture, making it easier to program while keeping the GFLOPS the same.
  • AMD bets on their A-series APU with integrated GPU.
  • NVIDIA has increased both memory bandwidth and GFLOPS at a steady pace.
  • NVIDIA has done the nitro-trick for double precision.

With NVIDIA GTX Titan (see three of them in the image), NVIDIA snatched victory from the jaws of defeat.

I’m not saying you should jump now to CUDA; there’s more than just GFLOPS. We should think also of costs and prevention of vendor-lockin. More particularly, I would like to show how unpredictable the market for accelerator-processors is.

Let’s take a look at the figures. Continue reading “AMD vs NVIDIA – Two figures that can tell a whole story”

Company History

There are not many companies like Stream HPC in Europe. Most others are or a government-institute for the national supercomputer, freelancers or actually not experienced with GPUs. So how did it start?

2009: The bore-out

Stream’s founder Vincent Hindriksen had to maintain a piece of software that was often failing to process the daily reports. After documenting the internals and algorithms of the code by interviewing the key people and some reverse engineering, it was a lot easier to create effective solutions for the bugs within the software. After fixing a handful of bugs, there was simply a lot less to do except reading books and playing online games.

To avoid becoming a master in Sudoku, he spent the following three weeks in rewriting all the code, using the freshly produced documentation. 2.5 hours needed to process the data was reduced to 19 seconds – yes, the kick for performance optimization was already there. For some reason it took well over 6 months to port the proof-of-concept, which was simply unbearable as somebody had to make sure the old code was maintained for 40 hours a week. As he was the only one who understood the code, there was no option to get placed at another project.

This ended in a bore-out: no wanting to go to work anymore. It’s actually quite the same as a burn-out, but with a different cause.

2010: a new start

From that bore-out the company was born the next year. There were two options: GPGPU (mostly OpenCL, a hobby) or build smart products for public transport. Two domains were bought, and the choice was made during the year. For the public transport a proof-of-concept was made, but the choice fell for the really difficult work.

Not much money was earned that year. Even government-support had to be paid back as one invoice was sent 2 weeks too early.

2011-2013: What’s a GPU?

We now have a clear idea on what GPUs can do, but in 2010-2014 GPUs were still for graphics only and sales were very difficult. Selling to somebody who states “GGGGGGraphics Processing Unit” is quite difficult.

A loan of €4000 by Vincent’s grandmother, a landlord who was relaxed with payments, the trust in the technology by early customers, and late payments to our creditors got us through.

2014: Employee #1

There was still not a stable income. Sales&marketing also took a lot of time, hurting the time that could be spent on actual work. But slowly we got more traction – more people started to believe in the company’s vision.

But by the end of the year the first employee was hired, Anca.

As the choice was to build a services company instead of a products company, banks and investors were not even interested in providing financial support. We can now say that for the long term this was the best – we can now fully control our own strategies and invest in our own product development.

2015-2020: First growth phase

Growth is hard, really hard. And we learned that, well, the hard way. Several decisions would now be made differently, but we adopted and continued. Some examples:

  • Investing in FPGAs too early. OpenCL-on-FPGAs was the next big thing, so based on what we got promised by vendors, we made the same promises to our customers. Many promises did not turn into reality.
  • Hiring the wrong people. Or: hiring people for whom we are the wrong company, as it goes both ways. We now define our culture, because we want people who fit our culture.
  • All the other things that are in the books under “early stage growth”.

By 2021 we got past the growth pains and go into the second phase.

2021: The second office

The first choice was actually in Belgium, because it was closer to Amsterdam. Unfortunately that project did not succeed. By coincidence we got into Budapest, and grew out of the office space the first year.