How to get full CMake support for AMD HIP SDK on Windows – including patches

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API  was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches”

N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

NVIDIA ended their support for OpenCL in 2012

If you are looking for the samples in one zip-file, scroll down. The removed OpenCL-PDFs are also available for download.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Continue reading “NVIDIA ended their support for OpenCL in 2012”

Privacy Policy

Who we are

We are a group of companies, based in the Netherlands, Hungary and Spain. We help our customers get their code run fast by optimizing the computations and using accelerators. We do this since 2010.

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymised string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service Privacy Policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Forms

Form-data is sent to self-hosted software and is not read by any third-party party.

Tracking

We use anonymized tracking to find out:

  • Which pages are visited how often
  • Which subjects are popular
  • Which pages are clicked through
  • From which countries or states the visitors are

During a visit/session, you get a random ID.

Cookies

If you leave a comment on our site you may opt in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

Tracking cookies last for 24 hours.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Who we share your data with

None of the data is shared with any third party. Marketing reports don’t contain any personal data.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

Anonymous tracking data is not thrown away, to find trends over the years.

What rights you have over your data

If you have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where your data is sent

Visitor comments and forms are checked through an automated spam detection service, ReCAPTCHA and Akismet.

Reporting problems

We are not in the business of monetizing user data, and believe in finding new customers through content.

As software and plugins change after updates, we are sometimes surprised that more is collected than we configured.

If anything is incorrect or not legal, please email to privacy@streamhpc.com. If you have generic questions, go to the contact page or email to info@streamhpc.com.

The 12 latest Twitter Poll Results of 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading “The 12 latest Twitter Poll Results of 2018”

OpenCL Videos of AMD’s AFDS 2012

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012”

The 13 application areas where OpenCL and CUDA can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used”

Big Data

Big_Bang_Data_exhibit_at_CCCB_17Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

  • capture, data-curation & data-management,
  • analysis, search & querying,
  • sharing, storage & transfer,
  • visualization, and
  • information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

OpenCL Fireworks

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Continue reading “OpenCL Fireworks”

Contact us

Thank you for your interest in our company and services. We will try to answer your question within 24 hours.

There are three ways to get in contact:

    First Name (required)

    Last Name (required)

    Email (required)

    Company (required)

    Phone number

    Your Message

    See ‘about us‘ for the address and other business-specific information.

    5 types of loops you should avoid

    In “Separation of compute, control and transfer” I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.

    The more you think about it, the more you find that a loop is not a loop.

    Continue reading “5 types of loops you should avoid”

    Promotion for OpenCL Training (’12 Q4 – ’13 Q2)

    So you want your software to be much faster than the competition?

    In 4 days your software team learns all techniques to make extremely fast software.

    Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

    We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

    Want more information? Contact us.

    About the training

    Location and Time

    OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

    A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

    Some discounts are available for developing countries.

    Agenda

    Day 1: Introduction

    Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

    • Introduction to parallel programming and GPU-programming
    • An overview of parallel architectures
    • The OpenCL model: host-programming and kernel-programming
    • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
    • Data-parallel and task-parallel programming
    Lab-session will be an image-filter.
    Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

    Day 2: Tools and advanced subjects

    Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

    • Static kernel analysis
    • Profiling
    • Debugging
    • Data handling and preparation
    • Theoretical backgrounds for faster code
    • Cache flow optimisation
    Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

    Day 3: Optimisation of memory and group-sizes

    Learn the concept of “data-transport is expensive, computations are cheap”.
    • Register usage
    • Data-rearrangement
    • Local and private memory
    • Image/texture memory
    • Bank-conflicts
    • Coalescence
    • Prefetching
    Lab-session: various small puzzles, which can be solved using the explained techniques.

    Day 4: Optimisation of algorithms

    Learn techniques to help the compiler make better and faster code.
    • Precision tinkering
    • Vectorisation
    • Manual loop-unrolling
    • Unbranching
    Lab-session: like day 3, but now with compute-oriented problems.

    Enrolment

    When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

    StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

    Put any remarks you have in the message. If you have any question, mail to trainings@streamhpc.com.

    [si-contact-form form=’7′]

    Avoiding false dependencies in only two steps

    Let’s approach the concept of programming through looking at the brain, the code and the computer.

    The idea of a program lives in the brain of a programmer. The way to get the program to the computer is using a system process called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase (although this is still this mostly from brain to computer).

    When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

    In this article I want to discuss one of these new coding-paradigm: dependencies in parallel software.
    Continue reading “Avoiding false dependencies in only two steps”

    AMD positions FirePro S10000 against both TESLA K10 (and K20)

    During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

    This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

    The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

    Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)”

    StreamComputing exists 5 years!

    5yearsSCIn January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

    I would like to thank the following people:

    • My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
    • My friends, for understanding I needed to work in the weekends and evenings.
    • My good friend Laura for supporting me during the hard times of 2011 and 2012.
    • My girlfriend Elena for always being there for me.
    • My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
    • My customers for believing in OpenCL and trusting StreamComputing.

    Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!”

    OpenCL mini buying guide for X86

    Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

    Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

    Continue reading “OpenCL mini buying guide for X86”

    GPU-developers, work for StreamHPC and friends

    applyBored at work? Go start working for one of the anti-boring GPU-expert companies: StreamHPC (Netherlands, EU), Appilo (Israel) or AccelerEyes (Georgia, US).

    We all look for people who know how to code GPUs. Experience is key, so you need to have at least one year of experience in GPU-programming.

    Submit your CV now and don’t give up on your dream leaving behind the boring times.

    StreamHPC

    Amsterdam-based StreamHPC is a young startup with a typical Dutch, open atmosphere (gezellig). Projects are always different from the previous as we work in various fields. Most work is in parallisation and OpenCL-coding. We are quite demanding of your knowledge on the various hardware-platforms OpenCL works on (GPUs, mobile GPUs, array-processors, DSPs, FPGAs).

    See the jobs-page for more information and how to apply.

    Appilo

    North Israel based Appilo is seeking GPU-programmers for project-basis contracts. Depending on the project, most of the work could usually be performed remotely.

    Use the mail on the contact-page at Appilo to send your CV.

    AccelerEyes

    Atlanta-based AccelerEyes delivers products which are used to accelerate C, C++, and Fortran codes on CUDA GPUs and OpenCL devices. They look for people who believe in their products and want to help make them better.

    See their jobs-page for more information and how to apply.

    OpenCL hardware test centre @ StreamHPC

    S10000
    AMD FirePro S10000

    Wanting to test your OpenCL-software on specific hardware? From now on, that is possible by logging in on StreamHPCs test-servers.

    Update: new prices. Got feedback it should compete with Amazon EC2.

    During the beta-period (February to April) we have available:

    [list1]

    • Dual FirePro S10000 (total ~11.8 TFLOPS single precision, ~2.9 TFLOPS double precision).
    • Several embedded GPU-boards attached.

    [/list1]

    By request more servers will be added, but for now we start lean.

    You’ll get:

    [list1]

    • 64 bit Linux server.
    • A secure SSH access with chrooted harddisk space.
    • Full NDA in case StreamHPC staff needs to assist.
    • Both shared and dedicated usage possible.
    • Assistance via mail and skype.

    [/list1]

    The costs for shared hours are €1,- per hour, for private hours €10,- €5,- per hour. Discounts are available for academics and existing customers (also from past trainings) – just contact us. Goal of the shared hours is to setup the software, do basic tests, but not to do intensive benchmarks.

    We hope you can now make a better decision what hardware to choose, or to have your paper’s conclusions based on more recent hardware.

    Want more info or have special requests? Call +31854865760 (office) or +31645400456 (cell), or use the contact form.

    Meteorology & Climatology

    earth-climateWhether short-term weather forecast or long-term climate change, predictions and modelling in meteorology and climatology generally happens on super-computers. StreamHPC focuses on preparing algorithms for execution on GPUs and other accelerator hardware (e.g., Xeon Phi), which make up the smallest elements in the distributed compute architecture of supercomputers. Contact us to discuss what we can do for you.

    Mobile Processor OpenCL drivers (Q3 2013) + rating

    saveFor your convenience: an overview of all ARM-GPUs and their driver-availability. Please let me know if something is missing.

    I’ve added a rating, to friendly push the vendors to get to at least an 7. Vendors can contact me, if they think the rating does not reflect reality.

    ZiiLabs

    SDK-page@StreamHPC

    Drivers can be delivered by Creative, when you pledge to order ZMS-40 processors. Mail us for a contact at Creative. Minimum order size is unknown.

    This device can therefore only be used for custom devices.

    [usr=4]

    Vivante

    SDK-page@StreamHPC

    They are found on public devices. Android-drivers that work on FreeScale processors are openly available and can be found here.

    [usr=8]

    Even though the processors are not that powerfull, Vivante/FreeScale offers the best support.

    Qualcomm

    SDK-page@StreamHPC

    Drivers are not shipped on devices, according various sources. Android-drivers are in the SDK-drivers though, which can be found here.

    [usr=7]

    Rating will go up, when drivers are publicly shipped on phones/tablets.

    ARM MALI

    Samsung SDK-page@StreamHPC

    There are lots of problems around the drivers for Exynos, which only seem to work on the Arndale-board when the LCD is also ordered.Android-drivers can be downloaded here.

    [usr=5]

    All is in execution – half-baked drivers don’t do it. It is unclear whom to blame, but it certainly has had influence on creating a new version of Exynos 5, the octa.

    Imagination Technologies

    SDK-page@StreamHPC

    TI only delivers drivers under NDA. Samsung has one board coming up with OpenCL 1.1 EP drivers.

    [usr=5]

    Rating will go up, when drivers from TI come available without obstacles, or Samsung delivers what they failed to do with the previous Exynos 5.

    Exciting times coming up

    Mostly because of a power-struggle between Google and the GPU-vendors, there is some hesitation to ship OpenCL drivers on phones and tablets. Unfortunately, Google’s answer to OpenCL RenderScript Compute, does not provide the needs wanted by developers. Google’s official answer is that it does not want fragmentation nor code that is optimised for a certain GPU. The interpreted answer is that Google wants vendor-lockin and therefore blocks the standard. Whatever the reason is, OpenCL is used as sword to show teeth who has a say about the future of Android – only the advertisement-company Google or also the group of named processor-makers and various phone/tablet-vendors?

    In H2 2014 Nvidia will ship CUDA-drivers with their Tegra 5 GPUs, making the soap complete.

    There are rumours Apple will intervene and will make OpenCL available on iOS. This would explain why there is put so much effort in showing OpenCL-results by Imagination and Qualcomm

    And always keep a close watch on POCL, the vendor-independent OpenCL implementation.

    [bordered_box border_color=” background_color=’#C1DAD6′]

    Need a programmer for any of the above devices? Hire us!

    [/bordered_box]