Call for speakers: IEEE eScience Conference in Amsterdam

Reading Time: 2 minutes

We’re in the program committee of the 14th IEEE eScience Conference in Amsterdam, organized by the Netherlands eScience Center. It will be held from 29 October to 1 November 2018, and the deadlines for sending the abstracts is Monday 18 June.

The conference brings together leading international researchers and research software engineers from all disciplines to present and discuss how digital technology impacts scientific practice. eScience promotes innovation in collaborative, computationally- or data-intensive research across all disciplines, throughout the research lifecycle.

We invite researchers to submit their:

  • state-of-the-art computer science work and demonstrate its application in one or more scientific disciplines
  • research that shows the impact of digital technology on any scientific discipline (not excluding the disciplines covered in the focused sessions)

Disciplines include, but are not limited to: life sciences, health research, forensic science, humanities, social sciences, physics, astronomy, climate, environmental science, and earth science. Digital technologies include, but are not limited to: machine learning, natural language processing, real time data analysis, inter-operability and linked data, multi-scale and multi model simulations, high performance computing, workflow technologies, visualization, and image processing.

Researchers can submit:

  • A full paper to be eligible for an oral presentation. Accepted oral presentations will be assigned to one of the general eScience sessions on either the multi-track day or the single track days.
  • A one-page abstract to be eligible for a poster presentation on Wednesday.

Both full papers and abstracts are peer reviewed and if accepted published in the conference proceedings. Rejected full papers will be considered for a poster presentation.

It is a requirement that at least one author of each accepted paper or abstract attends the conference.

Key Dates

  • Abstract submission deadline: Monday 18 June 23:59 (AoE)
  • Full paper submission deadline: Monday 18 June 23:59 (AoE)
  • Notification of acceptance: Wednesday 15 August 2018
  • Early bird registration deadline: Friday 31 August 2018
  • Camera-ready papers: Monday 17 September 2018
  • IEEE eScience Conference: Monday 29 October – Thursday 1 November 2018

Submission guidelines

Authors are invited to submit full conference papers (up to 10 pages excl. references) for an oral presentation using the IEEE 8.5 × 11 manuscript guidelines: double-column text using single-spaced 10-point font on 8.5 × 11 inch pages. Templates are available from

Authors are invited to submit a one-page abstract for a poster presentation using the same template, but limited to one page including figures and tables, excluding references.

Contributions should be submitted in PDF format to easychair:

Looking forward to read your submission for the 14th IEEE eScience Conference and to meet you here in Amsterdam!

Do you want to join StreamHPC?

Reading Time: 1 minute

As of this month Stream exists 8 years. 8 full years of helping our customers with fast software.In Chinese numerology 8 is a very lucky number, and we notice that.

Over the years we’ve kept focus on quality and that was a good decision. The only problem is that we don’t have enough time to write on the blog, to organise events or even send the “monthly” newsletter. With over 200 drafts for the blog (subjects that really should be shared), we need extra people to help us out.

Dear developers who are good with C,C++, OpenCL/CUDA and algorithms, please take a look at the following vacancies. I know you are frequenting our blog.

We’re also seeking an all-rounder that supports in daily operations, that includes management, customer contact, team-support, etc.

See below for more details.

    We’re looking forward to your application! We accept both remote and Amsterdam-based.

    Selecting Applications Suitable for Porting to the GPU

    Reading Time: 5 minutes

    Assessing software is never comparing apples to apples

    The goal of this writing is to explain which applications are suitable to be ported to OpenCL and run on GPU (or multiple GPUs). It is done by showing the main differences between GPU and CPU, and by listing features and characteristics of problems and algorithms, which can make use of highly parallel architecture of GPU and simply run faster on graphic cards. Additionally, there is a list of issues that can decrease potential speed-up.

    It does not try to be complete, but tries to focus on the most essential parts of assessing if code is a good candidate for porting to the GPU.

    GPU vs CPU

    The biggest difference between a GPU and a CPU is how they process tasks, due to different purposes. A CPU has a few (usually 4 or 8, but up to 32) ”fat” cores optimized for sequential serial processing like running an operating system, Microsoft Word, a web browser etc, while a GPU has a thousands of ”thin” cores designed to be very efficient when running hundreds of thousands of alike tasks simultaneously.

    A CPU is very good at multi-tasking, whereas a GPU is very good at repetitive tasks. GPUs offer much more raw computational power compared to CPUs, but they would completely fail to run an operating system. Compare this to 4 motor cycles (CPU) of 1 truck (GPU) delivering goods – when the goods have to be delivered to customers throughout the city the motor cycles win, when all goods have to be delivered to a few supermarkets the truck wins.

    Most problems need both processors to deliver the best value of system performance, price, and power. The GPU does the heavy lifting (truck brings goods to distribution centers) and the CPU does the flexible part of the job (motor cycles distributing doing deliveries).

    Assessing software for GPU-porting fitness

    Software that does not meet the performance requirement (time taken / time available), is always a potential candidate for being ported to a GPU. Continue reading “Selecting Applications Suitable for Porting to the GPU”

    DOI: Digital attachments for Scientific Papers

    Reading Time: 3 minutes

    Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.

    When we want to redo experiments of papers, it starts with finding the code and data used. A good start is Github or the homepage of the scientist. Also Gitlab. Bitbucket, SourceForge or the personal homepage of one of the researchers could be a place to look. Emailing the authors is often only an option, if the university homepage mentions such option – we’re not surprised to get no reaction at all. If all that doesn’t work, then implementing the pseudo-code and creating own data might be the only option – not if that will support the claims.

    So what if scientific papers had an easy way to connect to digital objects like code and data?

    Here the DOI comes in.

    Continue reading “DOI: Digital attachments for Scientific Papers”

    Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

    Reading Time: 3 minutes

    When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

    The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

    Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor.

    The library provides the most used PRNGs and QRNG (Quasi RNG) based on what we found on Github. Several you can find in cuRAND:

    • XORWOW
    • MRG32k3a
    • Mersenne Twister for Graphic Processors (MTGP32)
    • Philox (4×32, 10 rounds)
    • Sobol32

    If you’re familiar with PRNGs, you see that from the most important families of generators there is an option. Now it’s easy to port software that uses cuRAND. But that’s not all.

    rocRAND is faster than cuRAND in most cases

    rocRAND works on NVidia hardware too. And in most cases it’s faster than cuRAND.

    Here we compare rocRAND for normal-floats on the AMD Radeon Nano, rocRAND on the GTX 1080 and cuRAND on the GTX 1080. The professional grade GPUs, like the AMD MI25 are much faster – but this is just to show that the library written for AMD GPUs is faster than NVidia’s own library.


    This is before the optimization-phase on AMD R6 Nano and Nvidia GTX1080 – rocRAND on par with cuRAND.

    This is after the optimizations, where AMD GPUs get the upper hand due to higher bandwidth memory:

    As you can see, it’s preferable to also use the library for NVidia-only projects.

    Doing your own benchmarks

    On the Github of rocRAND you find instructions to benchmark the library on your own hardware. Do know that the library has been tuned for all recent AMD GPUs and Nvidia GTX GPUs, not Tesla GPUs. Also the code does not work on CPUs or Intel GPUs.

    More on random numbers on our blog

    Want to know more about Random numbers? We wrote about the subject before.

    Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

    Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

    Porting code that uses random numbers

    Need a tailored RNG?

    When you know the exact restrictions you have for your project, we can:

    • further tune the library to be even faster, or
    • add special characteristics (i.e. less cyclic), or
    • port other PRNGs to the GPU.

    We did not put these hacks in the official code, as we then could not guarantee a correct output for generic goals. In case you need a RND tailored for your specific needs, we are the team that can build it.

    Get in touch with the GPU Library Specialists today.

    GPU and FPGA challenge for MSc and PhD students

    Reading Time: 3 minutes

    While going through my email, I found out about the third “HiPEAC Student Heterogeneous Programming Challenge”. Unfortunately the deadline was last week, but just got an email: if you register by this weekend (17 September), you can still join.

    You just started your MSc/PhD in the area of compilation and computer architecture? Then you probably have heard that heterogeneous systems are the future and the only way to ensure increasing computing performance. Did you already program such a system? Excellent, you are just the candidate we seek! You did not? No problem, this will be an excellent opportunity to familiarise yourself with them!


    Following the continuing success of Heterogeneous Programming Challenge at the Computing Systems Week in Zagreb earlier this year, we are organising another follow-up event at the upcoming HiPEAC Computing Systems Week in Stuttgart (25-27th Oct). We will again provide you with a problem and your task will be to make it run as fast as you possibly can! How? That is entirely up to you! No restrictions will be imposed! What architecture? Again, up to you! GPU, that is fine! FPGA, go for it! Multicore APU system with dedicated GPU and FPGA board? Now we’re talking! You have a heterogeneous programming framework? Now is the time to test it out! The only restriction is: this is a student only event (Professors and PostDocs are very welcome to advise, but team member can only be students).


    This session’s problem has been selected in cooperation with Samsung. It is the: Five-Point Relative Pose Problem [see this PDF]. Samsung uses this problem as part of a system that calculates the camera position in a 360 degree video environment. Thus, this is not just a nice academic problem, but actually one that matters in real life! The important performance measure for this algorithm is the obtained frequency, i.e. how many 5 point pose estimation can you perform per second? But other metrics could be of interest as well, e.g. energy consumption, memory consumption, etc.


    There are reference implementations available in OpenCV and OpenGV, so why not compare the performance of your implementation against them. In addition Tommaso Maestri from Samsung might be able to help you with further questions.


    You first need to investigate the problem and existing implementations. After that everything is up to you. Choose an algorithm. Choose a platform. Choose an implementation. Choose a framework. After you have optimised everything there is to optimise, we would like you to prepare a short presentation explaining your strategy, overcome obstacles and show how fast it now runs. We plan to present the results at the upcoming Computing Systems Week in Stuttgart (pending approval by HiPEAC). There you will be able to discuss your results with HiPEAC students from all of Europe. In addition, we hope to select a panel of senior HiPEAC members from industry and academia to provide you with feedback on your approach.


    In order to give every HiPEAC university a chance to present their results, we would like to ask you to create only up to two teams per university and select a leader that will present the result (either in person or via a video link). Each team is able to implement as many different approaches as they like and present them in one short presentation.


    Please contact us, if you would like to form more than two teams per institution or a team with members from different institutions.


    Similar to previous events, we hope that we can provide some financial assistance for team leaders to attend the event in Stuttgart.


    Important dates:

    • 6 17 Sep: Register your team by emailing the team name, member names and emails to the organisers.
    • Wed, 25 Oct, 14:00 – 15:30: present your results in Stuttgart


    Questions? Please contact one of the organisers: Chris Fensch <>, Marisa Gil <>, and Georgios Goumas <> .

    If you register, let us know. If you win, also.

    The single-core, multi-core and many-core CPU

    Reading Time: 3 minutes

    Multi-core CPU from 2011

    CPUs are now split up in 3 types, depending on the number of cores: single (1), multi (2-8) and many (10+).

    I find it more important now to split up into these three types, as the types of problems to be solved by each is very different. Based on the problem-differences I’m even expecting that the number of cores between multi-core CPUs and many-core CPUs will grow.

    Below are the three types of CPUs discussed and a small discussion on many-core processors we see around. Continue reading “The single-core, multi-core and many-core CPU”

    HPC centre EPCC says: “Better software, better science”

    Reading Time: 2 minutes

    The University of Edinburgh houses the HPC centre EPCC. Neelofer Banglawala wrote about a programme which funds the development and improvement of scientific software, and also discussed about the results.

    Many of the 10 most used application codes on ARCHER have been the focus of an eCSE project. Software with more modest user bases have improved user uptake and widened their impact through eCSE-funded work. Furthermore, performance improvements can lead to tens of thousands of pounds of savings in compute time.

    Saving tens of thousands of pounds is certainly worth the investment. This also means more users can work on the same supercomputer, thus reducing waiting times.

    Another improvement was seen in scalability. Making software scalable not only improves performance, but also makes it possible to improve the problem size it can work on.

    For example, EPCC and the University of Hull moved VOX-FE, a Voxel-based Finite Element bone modelling suite, from a local desktop to ARCHER, dramatically improving its performance and functionality. VOX-FE can now analyse very large, high resolution models with accurate geometry. This, together with its new adaptive remodelling functionality, could make VOX-FE a novel way for paleobiologists to carry out in silico reconstruction experiments of partially recovered bone from dinosaurs and other fossils.

    Making new science possible using the same software is also worth the investment.

    Better software = higher work-efficiency

    We’re happy that the EPCC got the same conclusions as many of our customers, and publicly spoke about it. Better software not only can improve science, but can improve most industries if not all.

    Why is this? Dependency on software is that high, it directly influences work-efficiency. Improved software therefore improves work-efficiency and makes happier employees. Happier and more efficient employees reduces employee-costs. You therefore wouldn’t be surprised that the return-on-investment is therefore often less than a year.

    Have software that lets employees stall their progress or limit the problems they can tackle? Get in contact and we’ll discuss what can be done.

    Demo: cartoonizer on an Altera Arria 10 FPGA

    Reading Time: 2 minutes

    It takes quite some effort to program FPGAs using VHDL or Verilog. Since several years Intel/Altera has OpenCL-drivers, with the goal to reduce this effort. OpenCL-on-FPGAs reduced the required effort to a quarter of the time, while also making it easier to alter the specifications during the project. Exactly the latter was very beneficiary when creating the demo, as the to-be-solved problem was vaguely defined. The goal was to make a video look like a cartoon using image filters. We soon found out that “cartoonized” is a vague description, and it took several iterations to get the right balance between blur, color-reduction and edge-detection. Continue reading “Demo: cartoonizer on an Altera Arria 10 FPGA”

    CPU Code modernisation – our hidden expertise

    Reading Time: 2 minutes

    You’ve seen the speedups of 100’s to 1000’s of times. We all know that the lion share of the techniques would also work on modern multi-core CPUs, where GPUs get the last 2x to 8x only. When it’s 8x, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU?

    Now AMD has launched their 32 core CPU, the answer to that question changes. Not only because of the 32 cores but also because of the 256bit vector-computations via AVX2. This means that each clock-cycle 32 double4’s can be calculated on. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance.

    Intel reacted immediately by hinting they will also launch a 32-core Xeon. Meanwhile IBM works on launching their quad-threaded 24-core Power9 CPU. Cavium is providing 64bit 64-core ARM processors, which also need many threads to keep them busy. Not only core-numbers increase, but the interconnect standards now all push for upgrades while HBM seeks a way outside GPUs.

    CPUs have reborn.

    We will discuss the advantages of these CPUs in upcoming blog posts. Continue reading “CPU Code modernisation – our hidden expertise”

    New training dates for OpenCL on CPUs and GPUs!

    Reading Time: 1 minute

    OpenCL remains to be a popular programming language for accelerators, from embedded to HPC. Good examples are consumer software and embedded devices. With Vulkan potentially getting OpenCL-support in the future, the supported devices will only increase.

    For multicore-CPUs and GPUs we now have monthly training dates for the rest of the year:

    Minimum number of participants is two. By request the location and date can be changed.

    The first day of the training is the OpenCL Foundations training, which can be booked separately.

    For more information call us at +31854865760.

    IWOCL 2017 slides and proceedings now available

    Reading Time: 1 minute

    A month ago IWOCL (OpenCL workshop) and DHPCC++ (C++ for GPUs) took place. Meanwhile many slides and posters have been published online. As of today 23 talks are with slides.

    The proceedings are available via the ACM Digital Library. This needs a ACM Digital Library subscription of $198, if your company/university does not have access yet.

    IWOCL 2018 will be in Edinburgh (Scotland, UK), 15-17 May 2018 (provisional).

    Bug fixing the MESA 3D drivers

    Reading Time: 3 minutes

    Most of our projects are around performance optimisation, but we’re cleaning up bugs too. This is because you can only speed up software when certain types of bugs are cleared out. A few months ago, we got a different type of request. If we could solve bugs in MESA 3D that appear in games.

    Yes, we wanted to try that and got a list of bugs to solve. And as you can read, we were successful.

    Below you found a detailed description of one of the 5 bugs we solved by digging deep into the different games and the MESA 3D drivers. At the end of the blog post you’ll find the full list with links to issues in MESA’s bugtracker. Continue reading “Bug fixing the MESA 3D drivers”

    Slow Software Hotline

    Reading Time: 1 minute

    In the perfect world all software is fast, giving us time to do actual work. Unfortunately we live in an unperfect world, and we have to spend extra time controlling our anger as the software keeps us waiting.

    Therefore we have opened the Slow Software Hotline – reachable via both phone and email. It has one goal: make you feel happy again.

    Reporting is easy. Just name the commercial software that needs to be sped up and why. We’ll do the rest. If you need help with initial anger management due to the slow (and/or buggy) software, we’re happy to help with breathing practises.

    Phone: +31 854865760


    We will not sit still until all software is fast. Speeding up all software out there. One at the time.

    Why did AMD open source ROCm’s OpenCL driver-stack?

    Reading Time: 4 minutes

    AMD open sourced the OpenCL driver stack for ROCm in the beginning of May. With this they kept their promise to open source (almost) everything. The hcc compiler was open sourced earlier, just like the kernel-driver and several other parts.

    Why this is a big thing?
    There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

    It contains:

    • OpenCL 1.2 compatible language runtime and compiler
    • OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
    • Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

    For testing the implementation, see Khronos OpenCL CTS framework or Phoronix

    Why is it open sourced?

    There are several reasons. AMD wants to stand out in HPC and therefore listened carefully to their customers, while taking good note on where HPC was going. Where open source used to be something not for businesses, it is now simply required to be commercially successful. Below are the most important answers to this question.

    Give deeper understanding of how functions are implemented

    It is very useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to use. It does not tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

    Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

    Debug software

    When you are working on a large project and have to work with proprietary libraries, this is a typical delay factor. I think every software engineer has this experience that the library does not perform as was documented and work-arounds had to be created. Depending on the project and the library, it could take weeks of delay – only sarcasm can describe these situations, as the legal documents were often a lot better than the software documents. When the library was open source, the debugger could step in and give the “aha” that was needed to progress.

    When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and ofcourse your project hits that bug which nobody encountered before. Now all is open source, you can now step into the driver with the debugger. Moreover, the driver can be compiled with a fix instead of work-around.

    Get bugs solved quicker

    A trace now now include the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This not only improves reproducibility, but reduces the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

    Have “unimportant” specific improvements done

    Say your software is important and in the spotlight, like Blender or the LuxMark benchmark, then you can expect your software gets attention in optimisations. For the rest of us, we have to hope our special code-constructions are alike one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team does not have enough time. This is frustrating for both sides.

    Now everybody can have their improvements submitted, giving it does not slow down the focus software ofcourse.

    Get the feature set extended

    Adding SPIR-V is easy now. The SPIRV-frontend needs to be added to ROCm and the right functions need to be added to the OpenCL driver. Unfortunately there is no support for OpenCL 2.x host-code yet – I understood by lack of demand.

    For such extensions the AMD team needs to be consulted first, because this has implications on the test-suite.

    Get support for complete new things

    It takes a single person to make something completely new – this becomes a whole easier now.

    More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg. Optimised 128 bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All high performance code is there for you.

    Initiate alternative implementations (?)

    Not a goal, but forks are coming for sure. For most forks the goals would be like the ones above, to later be merged with the master branch. There are a few forks that go their own direction – for now hard to predict where those will go.

    Improve and increase university collaborations

    If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

    Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Added bonus for the students is that the GPUs might be allowed to used for games too.

    Timour Paltashev (Senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

    Get better support in more Linux distributions

    It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised), but the discussion is if this is part of the hardware or software to mark it as “libre”.

    There are many obstacles to have ROCm complete stack included as the default, but with the current state it makes much more chance.


    Phoronix has done some benchmarks on ROCm 1.4 OpenCL in January on several systems and now ROCm 1.5 OpenCL on a Radeon RX 470. Though the 1.5 benchmarks were more limited, the important conclusion is that the young compiler is now mostly on par with the closed source OpenCL implementation combined with the AMDGPU-drivers. Only Luxmark AMDGPU was (much) better. Same comparison for the old proprietary fgrlx drivers, which was fully optimised and the first goal to get even with. You’ll see that there will be another big step forward with ROCm 1.6 OpenCL.

    Get started

    You can find the build instructions here. Let us know in the comments what you’re going to do with it!

    Khronos Releases OpenCL 2.2 With SPIR-V 1.2

    Reading Time: 4 minutes

    Today Khronos has released OpenCL 2.2 with SPIR-V 1.2.

    The most important changes are:

    • A static subset of the C++14 standard as a kernel language. The OpenCL C++ kernel language includes classes, templates, lambda expressions, function overloads and many other constructs to increase parallel programming productivity through generic and meta-programming.
    • Access to the C++ language from OpenCL library functions to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
    • Pipe storage, which are compile time pipes. It’s a device-side type in OpenCL 2.2 that is useful for FPGA implementations to enable efficient device-scope communication between kernels.
    • Enhanced optimization of generated SPIR-V code. Applications can provide the value of specialization constants at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
    • KhronosGroup/OpenCL-Headers repository has been flattened. From now on, all version of OpenCL headers will be available not at separate branches, but all in master branch in separate directories named opencl10, opencl11 etc. Old branches are not removed, but they may not be updated in the future.
    • OpenCL specifications are now open source. OpenCL Working Group decided to publish sources of recent OpenCL specifications on GitHub, including just released OpenCL 2.2 and OpenCL C++ specifications. If you find any mistake, you can create an appropriate merge request fixing that problem.

    This is what we said about the release:

    “We are very excited and happy to see OpenCL C++ kernel language being a part of the OpenCL standard,” said Vincent Hindriksen, founder and managing director of StreamHPC. “It’s a great achievement, and it shows that OpenCL keeps progressing. After developing conformance tests for OpenCL 2.2 and helping finalizing OpenCL C++ specification, we are looking forward to work on first projects with OpenCL 2.2 and the new kernel language. My team believes that using OpenCL C++ instead of OpenCL C will result in improved software quality, reduced maintenance effort and faster time to market. We expect SPIR-V to heavily impact the compiler ecosystem and bring several new OpenCL kernel languages.”

    Continue reading “Khronos Releases OpenCL 2.2 With SPIR-V 1.2”

    Caffe and Torch7 ported to AMD GPUs, MXnet WIP

    Reading Time: 4 minutes

    Last week AMD released ports of Caffe, Torch and (work-in-progress) MXnet, so these frameworks now work on AMD GPUs. With the Radeon MI6, MI8 MI25 (25 TFLOPS half precision) to be released soonish, it’s ofcourse simply needed to have software run on these high end GPUs.

    The ports have been announced in December. You see the MI25 is about 1.45x faster then the Titan XP. With the release of three frameworks, current GPUs can now be benchmarked and compared.

    Especially the expected good performance/price ratio will make this very interesting, especially on large installations. Another slide discussed which frameworks will be ported: Caffe, TensorFlow, Torch7, MxNet, CNTK, Chainer and Theano.

    This leaves HIP-ports of TensorFlow, CNTK, Chainer and Theano still be released. Continue reading “Caffe and Torch7 ported to AMD GPUs, MXnet WIP”

    Rebranding the company name from StreamComputing to StreamHPC

    Reading Time: 2 minutes

    Since 2010 the name StreamComputing has been used and is widely known now in the GPU-computing industry. But the name has three problems: we don’t have the .com domain, it does not directly show what we do, and the name is quite long.

    Some background on the old domain name

    While the initial focus was Europe, for years our projects are done for 95% for customers outside the Netherlands and over 50% outside Europe – with the .eu domain we don’t show our current international focus.

    But that’s not all. The name sticks well in academics, as they’re more used to have longer names – just try to read a book on chemistry. Names I tested as alternatives were not well-received for various reasons. Just like “fast” is associated with fast food, computing is not directly associated with HPC. So fast computing gets simply weird. Since several customers referred to us as Stream, it made much sense to keep that part of the name.

    Not a new begin, but a more focused continuation

    Stream HPC defines more what we are: we build HPC software. Stream HPC combines the well-known name combined with our diverse specialization.

    • Programming GPUs with CUDA or OpenCL.
    • Scaling code to multiple CPU and GPUs
    • Creating AI-based software
    • Speeding up code and optimizing for given architectures
    • Code improvement
    • Compiler tests and compiler building (LLVM)

    With the HPC-focus we were more able to improve ourselves. We have put a lot of time in professionalizing our development-environment and project-management by implementing suggestions from current customers and our friends. We were already used to work fully independent and be self-managed, but now we were able to standardize more of it.

    The rebranding process

    Rebranding will take some time, as our logo and name is in many places. For the legal part we will take some more time, as we don’t want to get into problems with i.e. all the NDAs. Email will keep working on both domains.

    We will contact all organizations we’re a member of over the coming weeks. You can also contact us, if you read this.

    StreamComputing will never really go away. The name was with us for 7 years and stands for growing with the upcoming of GPU-computing.

    What is Khronos as of today?

    Reading Time: 2 minutes

    The Khronos Group is the organization behind APIs like OpenGL, Vulkan and OpenCL. Over one hundred companies are a member and decide together what your next year phone, camera, computer or media device will be capable of.

    We’re at the right, near the bottom.

    We work most with OpenCL, but you probably noticed we work with OpenGL, Vulkan and SPIR too. Currently they have the following APIs:

    • COLLADA, a file-format intended to facilitate interchange of 3D assets
    • EGL, an interface between Khronos rendering APIs such as OpenGL ES or OpenVG and the underlying native platform window system
    • glTF, a file format specification for 3D scenes and models
    • OpenCL, a cross-platform computation API.
    • OpenGL, a cross-platform computer graphics API
    • OpenGL ES, a derivative of OpenGL for use on mobile and embedded systems, such as cell phones, portable gaming devices, and more
    • OpenGL SC, a safety critical profile of OpenGL ES designed to meet the needs of the safety-critical market
    • OpenKCam, Advanced Camera Control API
    • OpenKODE, an API for providing abstracted, portable access to operating system resources such as file systems, networks and math libraries
    • OpenMAX, a layered set of three programming interfaces of various abstraction levels, providing access to multimedia functionality
    • OpenML, an API for capturing, transporting, processing, displaying, and synchronizing digital media
    • OpenSL ES, an audio API tuned for embedded systems, standardizing access to features such as 3D positional audio and MIDI playback
    • OpenVG, an API for accelerating processing of 2D vector graphics
    • OpenVX, Hardware acceleration API for Computer Vision applications and libraries
    • OpenWF, APIs for 2D graphics composition and display control
    • OpenXR, an open and royalty-free standard for virtual reality and augmented reality applications and devices
    • SPIR, a intermediate compiler target for OpenCL and Vulkan
    • StreamInput, an API for consistently handling input devices
    • Vulkan, a low-overhead computer graphics API
    • WebCL, a JavaScript binding to OpenCL within a browser
    • WebGL, a JavaScript binding to OpenGL ES within a browser on any platform supporting the OpenGL or OpenGL ES graphics standards

    Too few people understand that the organization is very unique, as the biggest processor vendors are discussing collaborations and how to move the market, while they’re normally the fiercest competitors. Without Khronos it would have been a totally different world.

    AMD ROCm 1.5 Linux driver-stack is out

    Reading Time: 4 minutes

    ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

    For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

    About ROCm and it’s short history

    The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

    ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

    Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

    The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

    What does ROCm 1.5 bring?

    Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

    The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
    The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

    Other improvements:

    1. Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
    2. Added new API hipMemPtrGetInfo
    3. Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

    Bug fixes:

    1. Fixed Copyright and header names
    2. Fixed issue with bit_extract sample
    3. Enabled lgamma and lgammaf
    4. Added guard for GFX8 specific intrinsics
    5. Fixed few issues with operator overloading of vector data types
    6. Fixed atanf
    7. Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
    8. Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

    Current hardware support:

    • GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
    • GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

    If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

    Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

    Install ROCm 1.5

    Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

    Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

    The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

    GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

    You need to alter this line with every update, else it’ll keep using the old version.

    Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

    /opt/rocm/hsa/sample$ sudo make
    gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
    gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
    /opt/rocm/hsa/sample$ ./vector_copy
    Initializing the hsa runtime succeeded.
    Checking finalizer 1.0 extension support succeeded.
    Generating function table for finalizer succeeded.
    Getting a gpu agent succeeded.
    Querying the agent name succeeded.
    The agent name is gfx803.
    Querying the agent maximum queue size succeeded.
    The maximum queue size is 131072.
    Creating the queue succeeded.
    "Obtaining machine model" succeeded.
    "Getting agent profile" succeeded.
    Create the program succeeded.
    Adding the brig module to the program succeeded.
    Query the agents isa succeeded.
    Finalizing the program succeeded.
    Destroying the program succeeded.
    Create the executable succeeded.
    Loading the code object succeeded.
    Freeze the executable succeeded.
    Extract the symbol from the executable succeeded.
    Extracting the symbol from the executable succeeded.
    Extracting the kernarg segment size from the executable succeeded.
    Extracting the group segment size from the executable succeeded.
    Extracting the private segment from the executable succeeded.
    Creating a HSA signal succeeded.
    Finding a fine grained memory region succeeded.
    Allocating argument memory for input parameter succeeded.
    Allocating argument memory for output parameter succeeded.
    Finding a kernarg memory region succeeded.
    Allocating kernel argument memory buffer succeeded.
    Dispatching the kernel succeeded.
    Passed validation.
    Freeing kernel argument memory buffer succeeded.
    Destroying the signal succeeded.
    Destroying the executable succeeded.
    Destroying the code object succeeded.
    Destroying the queue succeeded.
    Freeing in argument memory buffer succeeded.
    Freeing out argument memory buffer succeeded.
    Shutting down the runtime succeeded.

    Also clinfo (installed from the default repo) should work.

    Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

    Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

    DHPCC++ Program known

    Reading Time: 1 minute

    During IWOCL a workshop takes place that discusses the opportunities that C++ brings to OpenCL-enabled processors. A well-know example is SYCL, but various other approaches are talked about.

    The Distributed & Heterogeneous Programming in C/C++ Workshop just released their program:

    • HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking.
    • SYCL C++17 and OpenCL interoperability experimentation with triSYCL.
    • KART – A Runtime Compilation Library for Improving HPC Application Performance.
    • Using SYCL as an Implementation Framework for HPX Compute.
    • Adding OpenCL to Eigen with SYCL.
    • SYCL-BLAS: Leveraging Expression Trees for Linear Algebra.
    • Supporting Distributed and Heterogeneous Programming Models in ISO C++.
    • Towards an Asynchronous Data Flow Model for SYCL 2.2.
    • Communicating Execution Contexts using Channels.
    • Towards a Unified Interface for Data Movement.
    • Panel: What do you want in C++ for Heterogeneous.

    As C++ is an important direction for OpenCL, we expect to see most of the discussions on programmability of OpenCL-enabled processors be done here.

    The detailed program is to be released soon. For more information see the DHPCC++ webpage. At the IWOCL website you can buy tickets and passes – combining with IWOCL gives a discount.