GPU-related PHD positions at Eindhoven University and Twente University

Posted by Vincent Hindriksen on 12 June 2019

We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

We’d like to put the following PhD position to your attention:

Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

Continue reading →

OpenCL hardware test centre @ StreamHPC

Posted by Vincent Hindriksen on 18 February 2014

Wanting to test your OpenCL-software on specific hardware? From now on, that is possible by logging in on StreamHPCs test-servers.

Update: new prices. Got feedback it should compete with Amazon EC2.

During the beta-period (February to April) we have available:

[list1]

Dual FirePro S10000 (total ~11.8 TFLOPS single precision, ~2.9 TFLOPS double precision).
~~Several embedded GPU-boards attached.~~

[/list1]

By request more servers will be added, but for now we start lean.

You’ll get:

[list1]

64 bit Linux server.
A secure SSH access with chrooted harddisk space.
Full NDA in case StreamHPC staff needs to assist.
Both shared and dedicated usage possible.
Assistance via mail and skype.

[/list1]

The costs for shared hours are €1,- per hour, for private hours ~~€10,-~~ €5,- per hour. Discounts are available for academics and existing customers (also from past trainings) – just contact us. Goal of the shared hours is to setup the software, do basic tests, but not to do intensive benchmarks.

We hope you can now make a better decision what hardware to choose, or to have your paper’s conclusions based on more recent hardware.

Want more info or have special requests? Call +31854865760 (office) or +31645400456 (cell), or use the contact form.

ImageJ and OpenCL

Posted by Vincent Hindriksen on 30 January 2011 with 5 Comments

For a customer I’m writing a plugin for ImageJ, a toolkit for image-processing and analysis in Java. Rick Lentz has written an OpenCL-plugin using JOCL. In the tutorial step 1 is installing the great OS Ubuntu, but that would not be the fastest way to get it going, and since JOCL is multi-platform this step should be skippable. Furthermore I rewrote most of the code, so it is a little more convenient to use.

In this blog-post I’ll explain how to get it up and running within 10 minutes with the provided information.

Continue reading “ImageJ and OpenCL” →

Assessment of existing code-base’s quality

You found the main computation takes over 90% of the processing time, or you found the framework to be slow in general. We got in contact and discussed speeding up your software, after which we shared this assessment with you. This assessment should give you insights if the software-project is ready to be shared with Stream HPC.

The larger the code-base, the more important its code-quality

The first step is to prepare code for porting or optimization. As it’s not always easy to know what to do, we’ve defined 3 levels of code quality. The higher the quality-level, the fewer obstacles for the project and the lower the costs, the less communication is required and the fewer frustrations.

Preparing a project for porting / optimizing

The sections below discuss the 3 levels where a project can be. The goal is that you do a self-assessment and write down answers for each question with details, not only provide the final answer.

The code needs to have all levels marked in red: full level 1 and the high level of level 2. It does not matter if the existing code is written in Matlab, Python, C, C++, Assembly, OpenCL, CUDA or any other language.

The action points of all levels need to be done. When a project is not ready yet, we can assist in improving code-quality, but assume a lot has to be done by your team. This will be a separate (pre-)project, and the full estimation for the porting/optimization can only be done after that.

If not possible to level up the software or no source files are available, it will be handled as a black box project or R&D project. Do know that such projects can never be done fixed priced and are always unique. The generic part of the process is described in this blog – we are experienced in doing such focused R&D projects.

Level 1: Understandability

Goal: Can the software be understood without help from the main developers?

High level

Are the algorithms explained in i.e. scientific papers?
Do alternate implementations exist? I.e. in Python or Matlab.
Optional: is there a presentation/overview on the algorithm?

Mid level

Is there a software design document?
Is test-data provided that can be used to run the code?

Low level

Are all functions documented?
Is it clear what each (part of the) function is doing?
Are all variables documented?
Is it clear what each variable means?

Action points

A good communication-plan to get questions answered quickly. This includes a direct contact and regular calls.
Walk the code and improve/update the existing in-code documentation. Often it’s not looked at since it was written.

Level 2: Testability

Goal: Are there few bugs and can new bugs be easily detected?

High level

Is there a golden standard of the standard, such like a proof-of-concept or the existing code? Is it available to us?
Are the outputs deterministic? Or can they be made deterministic quickly?
Is there a good understanding where the algorithm is less stable to unstable, and can this be explained?
Is there clarity on the required maximum quantitative errors that would define the correctness of output? Are there high level test-cases for all of these?
Is there clarity on required maximum qualitative errors that would define the correctness of output? Is the collection of examples and counter-examples large enough to define the error in full?
Can the whole library tested using the sample input and output?
Is there clarity on the required precision? When is an answer correct?

Mid Level

Does the CI contain static code analysis?
Are there test-cases or automated tools for finding qualitative errors?

Low level

Are the compute-intensive functions covered by functional tests?
Are the most important functions covered by functional tests?
Are the other functions covered by functional tests?

Action points

Making the code deterministic by temporary removing random number generator.
Define what is correct output in detail.
Complete the test cases for the different types of errors.
Creating several sets of data-in and its correct data-out. This will be used for acceptance of ported code, giving the maximum errors allowed.
Decide which sub-results need to be compared.

Level 3: Quality

Goal: Is the software easy to maintain and extend?

High level

Is there build-automaton, like Cmake or Meson?
Is the code multi-OS?
Are tests run with every commit, or at least daily/nightly?

Mid level

Are functions defined to only do one thing and do it well?
Nobody of the team labelled the code as “Spaghetti code”?
Are function names self-explanatory?
Are variable-names self-explanatory?

Low level

There is no duplicated code?
There is no code that actually can be made much less complex?
There are no functions or methods longer than 100 lines excluding documentation?
There are no large classes?
There are no functions or methods with far more than 7 parameters?
There is only few commented out code?
There is only few dead code? This is code that is not being called from anywhere.
There is no code that should be rewritten?
There is no variables that are being reused?
There is no significant dependence on global state?

Action points

Prepare the (new) GPU-code for continuous integration.
Document how the ported code maps to the existing code.

Costs Assessment

Code at level 0 can take 10 – 20 times more time than code at level 1. How much exactly is difficult to say, and that’s exactly the reason we require a minimal level. The bitter pill is that the costs of not cleaning up the code is often even higher due to increased hardware costs, increased maintenance costs and lack of innovation.

There is only one reason to keep code quality under level 1: when it’s going to be replaced within a month.

When level 1 is mostly done, each missing item can take 20% to 40% of extra costs. A good example is not having good test-cases or no CPU-code available or not even an executable. This adds costs in both the implementation phase and the acceptance phase. Making a new CPU-implementation first is often cheaper.

From the described minimal level (level 1 in full, level 2 the high level) costs are more predictable and less costly. When getting a quote, these can be requested to be mentioned and you can choose to do it yourself or let us do it.

Contact

Just discuss your goals, after done an assessment. We can guide you in prioritization of making your code ready for porting.

Email to info@streamhpc.com or look to the other contact options on the contact page here. Then we’ll schedule a call and discuss what you need.

Copyright

If you find this self-assessment useful, know it took us a long time to improve this document and a lot of experience is hidden in it. But as we find quality software very important, we’re releasing this list under CC-BY-NC-ND: it cannot be altered or used commercially, and it must have a clear reference to Stream HPC as the authors. In other words: it must be clear that you did not do the research or writing of this assessment yourself.

If you have feedback or suggestions, we’d really like to hear from you!

AMD positions FirePro S10000 against both TESLA K10 (and K20)

Posted by Vincent Hindriksen on 15 November 2012

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)” →

We ported GROMACS from CUDA to OpenCL

Posted by Vincent Hindriksen on 1 November 2014 with 9 Comments

GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer. We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

Improve your OpenCL skills. What really applies to the project is this quote:

Tell me and I forget.
Teach me and I remember.
Involve me and I learn.
Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have. We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
- Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
- Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
- Berk Hess. Who helped answering the harder questions and feeding the discussions.
Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
Michael Papili, for helping us with designing a demo for SC14.
Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.

Khronos OpenCL presentation at SIGGRAPH 2010

Posted by Vincent Hindriksen on 9 September 2010 with 1 Comment

Here you find the videos uploaded by Khronos of their presentation about OpenCL. I added the time-line, so you can scroll to the more interesting parts easily. The presentation by Ofer Rosenberg of Intel and Cliff Woolly of NVIDIA were not uploaded (yet). Please note that for non-American people the speech of Affi Munchie is hard to hear; luckily his sheets explain most.

http://www.youtube.com/watch?v=BdZFtcQ2LYw

For the first two presentations the sheets can be downloaded from the Khronos-website. The time-line has the sheet-numbers mentioned.

0:00 [sheet 1] Presentation by the president of Khronos and chair of the session: Neill Trevett of NVIDIA.
0:06 [sheet 2] Welcome and a quick overview
1:12 [sheet 3] The prizes for the attendees (not us, online viewers)
1:40 [4] Overview of all members of Khronos. Khronos does not only take care of OpenCL but also the more famous OpenGL and projects like Collada.
2:26 [5] Processor Parallelism. CPUs are getting more parallel and GPUs more programmable. The overlapping area is called Heteregenous Computing and there is where OpenCL pops up.
3:10 [6] OpenCL timeline from version 1.0 to 1.1.
4:44 [7] OpenCL workinggroup with only 30 logos. He mentions missing logos like the one from Apple.
5:18 [8] The Visual Computing Ecosystem, where OpenCL interoperability with other standards are shown. The talk is not complete, so I don;t know if he talks about DirectX.

Continue reading “Khronos OpenCL presentation at SIGGRAPH 2010” →

Jobs

We have jobs for people who get bored easily.

DSC_0046_small — *We kept space for you*

In return, we offer a solution to boredom, since performance engineering is hard.

As the market demand for affordable high performance software grows, StreamHPC continuously looks for great and humble people to join our team(s). With upcoming products, new markets like OpenCL on low-power ARM processors and compute-intensive applications like AI and VR, we expect continuous growth.

We currently have two types of jobs:

Software Performance Engineers. The heroes who make software go vroom vroom
Growth and Support. The heroes who make the people and the company go vroom vroom

For the second group, you will find the most changes in the job-ads, as we seek people who want to join us for the years to come.

Find our jobs and Apply

To apply, go to our recruitment website.

The hiring process

The procedure for the technical roles is as follows:

You send your CV, some public code (preferably C/C++/CUDA/HIP/OpenCL) and your motivational letter/email.
We do a quick scan of your CV and letter. (In most cases you’ll get feedback within 2 to 5 days)
For those who are left, you do a simple online test. This is to get a grasp of your way of working and thinking, and to prepare you for the longer test. (25 minutes max)
You will have a (video) talk with HR (30-45 minutes)
After that, you are invited for a longer online test. You show your skills on C/C++ and algorithms. Be warned this includes the ridiculous puzzles, simply because we actually use those ridiculous things (takes 2 – 3 hours)
You’ll get a technical interview on C++ and GPGPU (2 hours)
We’ll send you a conditional job-offer, assuming that the rest will be ok.
We now go into the long interview to be absolutely sure we are a fit, and to introduce each other in more detail (takes 3 hours)
We check your references. If these check out, you’re hired.

We try to minimize the time it takes you, while also giving you enough chances to proof yourself. As we are sometimes flooded with applications, we filter out on simple things like “Did not mention CUDA, OpenCL, SYCL, GLSL, HLSL, etc” – be sure you add such experience to your CV or cover letter, also when at university or a personal Github project.

Want to know more about the process, how to prepare, etc? Read “our job-application process“.

For growth&support-roles, it is close to the above but without the coding test.

For management-roles, we do test for technical expertise, as we cannot permit loosening this know-how when project-related.

So where are you waiting for? Apply for a job!

Xeon Phi Knights Corner compatible workstation motherboards

Posted by Vincent Hindriksen on 1 August 2015 with 6 Comments

Intel has assumed a lot if it comes to XeonPhi’s. One was that you will use it on dual-Xeon servers or workstations and that you already have a professional supplier of motherboards and other computer-parts. We can only guess why they’re not supporting non-professional enthusiasts who got the cheap XeonPhi.

After browsing half the internet to find an overview of motherboards, I eventually emailed Gigabyte, Asus and ASrock for more information for a desktop-motherboard that supports the blue thing. With the information I got, I could populate the below list. Like usual we share our findings with you.

Quote that applies here: “The main reason business grade computer supplies can be sold at a higher price is that the customers don’t know what they’re buying“. When I heard, I did not know why the customer is not well-informed – now I do. Continue reading “Xeon Phi Knights Corner compatible workstation motherboards” →

Starting with GROMACS and OpenCL

Posted by Vincent Hindriksen on 15 November 2014

Now that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

-DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
-DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
-DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
-DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
-DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
-DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
-DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
-DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
-DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
-DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
-DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

Taking on OpenCL

Posted by Vincent Hindriksen on 6 August 2012

Nothing worth having comes easy - "Dr. Kelso" — Quote by Dr. Kelso (from the series “Scrubs”) – click for video

OpenCL is getting more and more important and for more developers a skill worth having. At StreamHPC we saw this coming in 2010 and have been training people in OpenCL since. A few weeks ago I got a question on how to take on OpenCL, which could be interesting for more people: how to take on OpenCL. In other words: the steps to take to learn OpenCL the quickest. Since the last time I wrote on learning OpenCL is almost two years ago, it is a good time to share more recent insights on this matter.

Taking on OpenCL takes four main steps in this order:

Understanding the hardware and architectures.
Thinking both in parallel and in vectors.
Learning the OpenCL language itself.
Profiling and debugging.

You see that is a whole difference from learning for instance Java with a Pascal-background. Learning VHDL for programming FPGAs comes closer, though you don’t need to tinker with timings when doing OpenCL. Let’s go through the steps.

Continue reading “Taking on OpenCL” →

Big announcements: SYCL 1.2, WebCL 1.0 and OpenCL 2.0

Posted by Vincent Hindriksen on 19 March 2014

Khronos just announced three OpenCL based releases:

SYCL 1.2 Provisional Spec – Abstraction Layer for Leveraging C++ and OpenCL
WebCL 1.0 Final Spec – JavaScript bindings to OpenCL
OpenCL 2.0 Adopters Program – Conformance for OpenCL 2.0 implementations

Below I’ve quoted the summaries. For each of these I’ve prepared articles, but due to lack of time haven’t been able to finish and publish them. So for now some remarks after the summaries.

Khronos Releases SYCL 1.2 Provisional Specification

Programming abstraction layer to enable applications and high-level frameworks to leverage C++ and OpenCL for heterogeneous parallel acceleration

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback. SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++. For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration – and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

https://www.khronos.org/news/press/khronos-releases-sycl-1.2-provisional-specification

Higher level languages are very important, as OpenCL is simply too low-level. SYCL is another effort to help researching & improving this area, as we haven’t found the holy grail. Languages like C++AMP and RenderScript claim they can replace OpenCL, but we all know that some implementations of those languages have been done on top of OpenCL.

Khronos Releases WebCL 1.0 Specification

JavaScript bindings to OpenCL brings heterogeneous parallel computing to Web browsers

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the ratification and public release of the WebCL™ 1.0 specification. Developed in close cooperation with the Web community, WebCL extends the capabilities of HTML5 browsers by enabling developers to offload computationally intensive processing to available computational resources such as multicore CPUs and GPUs. WebCL defines JavaScript bindings to OpenCL™ APIs that enable Web applications to compile OpenCL C kernels and manage their parallel execution. Like WebGL™, WebCL is expected to enable a rich ecosystem of JavaScript middleware that provides access to accelerated functionality to a wide diversity of Web developers.

https://www.khronos.org/news/press/khronos-releases-webcl-1.0-specification

WebCL gets more and more attention, even before it was even official. It would be interesting to see the same growth to higher level language as we have with OpenCL now. for this reason we started the Learning WebCL website, to help you learn WebCL in the future.

Khronos Launches OpenCL 2.0 Adopters Program

Conformance tests now available to certify OpenCL 2.0 implementations

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the availability of the official conformance test suite for the OpenCL 2.0 specification, making it possible for implementers to certify that their implementations are officially conformant thorough the Khronos OpenCL Adopters Program. Khronos has also released a set of header files for OpenCL 2.0 and an updated specification with a number of clarifications and corrections to the specification first released in November 2013.

https://www.khronos.org/news/press/khronos-launches-opencl-2.0-adopters-program

Finally the headers are open. Stay tuned for an extensive OpenCL 1.2 vs OpenCL 2.0 comparison, which I have prepared but were unable to finish without the header files.

I hope you are as happy with these announcements as I am. This tells me that OpenCL is ready for real business.

Freescale / Vivante

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted an open standard. They have support for:

OpenCL
OpenGL
Google RenderScript

OpenCL

See Vivante’s GPGPU-page for more info, where below table is taken from.

	GC800 Series	GC1000 Series	GC2000 Series	GC4000 Series	GC5000 Series	GC6000 Series
Clock Speed MHz	600 – 800	600 – 800	600 – 800	600 – 800	600 – 800	600 – 800
Compute Cores	1	1	1	2	4	Up to 8
Shader Cores	1 (Vec-4) 4 (Vec-1)	2 / 4 (Vec-4) 8 / 16 (Vec-1)	4 (Vec-4) 16 (Vec-1)	8 (Vec-4) 32 (Vec-1)	8 (Vec-4) 32 (Vec-1)	16 (Vec-4) 64 (Vec-1)
Shader GFLOPS	6–8 (High) 12–16 (Medium)	11–30 (High) 22–60 (Medium)	22–30 (High) 44–60 (Medium)	44–60 (High) 88–120 (Medium)	44–60 (High) 88–120 (Medium)	88–118 (High) 176–236 (Medium)
GPGPU Options	Embedded Profile	Embedded Profile	Embedded / Full Profile	Embedded / Full Profile	Embedded / Full Profile	Embedded / Full Profile
Cache Coherent	Yes	Yes	Yes	Yes	Yes	Yes

One big advantage Vivante claims to have over the competition, is the GFLOPS/mm². This could be of advantage to win the 1TFLOPS-war over their competition (which they’ve entered). The upcoming GC4000 series can push around 48GFLOPS, leaving the 1TFLOPS to the GC6000 series.

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

There are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

Other Boards

Alternative evaluation boards from 3^rd parties can be found by searching on internet for “i.MX6Q board” as there are many! For instance the Wandboard (i.MX6Q, $139, shown at the left – was tipped that the Dual is actually a Duallite and thus not have support!!).

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

More info

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

Vivante GPU (Freescale i.MX6)

Vivante got into the news with OpenCL, when winning in the automotive-industry from NVIDIA. Reason: the car-industry wanted the open standard OpenCL. We at StreamHPC were very glad with this recognition from the automotive-industry.

See their GPGPU-page for more info, where below table is taken from.

	GC800 Series	GC1000 Series	GC2000 Series	GC4000 Series	GC5000 Series	GC6000 Series
Clock Speed MHz	600 – 800	600 – 800	600 – 800	600 – 800	600 – 800	600 – 800
Compute Cores	1	1	1	2	4	Up to 8
Shader Cores	1 (Vec-4) 4 (Vec-1)	2 / 4 (Vec-4) 8 / 16 (Vec-1)	4 (Vec-4) 16 (Vec-1)	8 (Vec-4) 32 (Vec-1)	8 (Vec-4) 32 (Vec-1)	16 (Vec-4) 64 (Vec-1)
Shader GFLOPS	6–8 (High) 12–16 (Medium)	11–30 (High) 22–60 (Medium)	22–30 (High) 44–60 (Medium)	44–60 (High) 88–120 (Medium)	44–60 (High) 88–120 (Medium)	88–118 (High) 176–236 (Medium)
GPGPU Options	Embedded Profile	Embedded Profile	Embedded / Full Profile	Embedded / Full Profile	Embedded / Full Profile	Embedded / Full Profile
Cache Coherent	Yes	Yes	Yes	Yes	Yes	Yes

Their GPUs are sold as IP to CPU-makers, so they don’t sell their own chips. Vivante has created the GPU-drivers, but you have to contact the chip-maker to obtain them.

Freescale i.MX6

The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)

Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 ES shaders and OpenCL EP kernels.

Board: SABRE

There are several boards available. Freescale suggests to use the SABRE Board for Smart Devices ($399).

This Linux board support package including OpenCL drivers and general BSP documentation is available for free download on the product page under Software & Tools tab.

Other Boards

OpenCL-driver found on the Wandboard Ubuntu-image – download clinfo-output here (gcc clinfo.c -lOpenCL -lGAL).

Drivers & SDK

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

Most information is given in Get started with OpenCL on Qualcomm i.MX6.

IMX6_GPU_SDK: a collection of GPU code samples, for OpenCL the work is still in progress. You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”
IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS: GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version! You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

More info

Check out the i.MX-community. You can also contact us for more info.

To see a great application with 4 i.MX6 Quad boards using OpenCL, see this “Using i.MX6Q to Build a Palm-Sized Heterogeneous Mini-HPC” project.

The Fastest Payroll System Of The World

Posted by Vincent Hindriksen on 9 August 2023

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading →

Event: Embedded boards comparison

Posted by Vincent Hindriksen on 11 June 2015

A 2012 board — One of the first OpenCL enabled boards from 2012.

Date: 17 September 2015, 17:00
Location: Naritaweg 12B, Amsterdam
Costs: free

Selecting the right hardware for your OpenCL-powerd product is very important. We therefore organise a three hour open house where we you can test, benchmark and discuss many available chipsets that support OpenCL. For each showcased board you can read and hear about the advantages, disadvantages and preferred types of algorithms.

Board with the following chipsets will be showcased:

ARM Mali
Imagination PowerVR
Qualcomm Snapdragon
NVidia Tegra
Freescale i.MX6 / Vivante
Adapteva

Several demo’s and benchmarks are prepared that will continuously run on each board. We will walk around to answer your questions.

During the evening drinks and snacks are available.

Test your own code

There is time to test OpenCL code for free in our labs. Please get in contact, as time that evening is limited.

Registration

Register by filling in the below form. Mention with how many people you will come, if you come by car and if you want to run your own code.

[contact_form]

Updated: OpenCL and CUDA programming training – now online

Posted by Vincent Hindriksen on 27 January 2020

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

Four days of training online
Free code-review after the training, to get feedback on what you created with the new knowledge;
1 month of limited support, so you can avoid StackOverflow;
Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading →

The magic of clGetKernelWorkGroupInfo

Posted by Vincent Hindriksen on 22 October 2015

It’s not easy to get the available private memory size – actually it’s impossible to get this information directly from the device/drivers, using the OpenCL API. This can only be explained after you dive deep into clGetKernelWorkGroupInfo – the function that tells you how well your kernel fits on the device. It is strange this function is not often discussed.

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

Returns the amount of local memory, in bytes, being used by a kernel (per work-group). Use CL_DEVICE_LOCAL_MEM_SIZE to find out the maximum.

CL_KERNEL_PRIVATE_MEM_SIZE

Returns the minimum amount of private memory, in bytes, used by each work-item in the kernel.

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

This answers the question “What is the maximum value for global_work_size argument that can be given to clEnqueueNDRangeKernel?”. The result is of type size_t[3].

CL_KERNEL_WORK_GROUP_SIZE

The is the same for local_work_size. The kernel’s resource requirements (register usage etc.) are used, to determine what this work-group size should be.

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

If __attribute__((reqd_work_group_size(X, Y, Z))) is used, then (X, Y, Z) is returned, else (0, 0, 0).

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

It returns a performance-hint: if the total number of work-items is a multiple of this number, then you’ll get good results. So no more remembering 32 or 64 for specific GPUs, but simply kick in a call to this function.

Combined with clDeviceInfo’s CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, you can fine-tune your workgroup-size in case you need the group-size to be as large as possible.

ARM MALI

[infobox type=”information”]

Need a MALI programmer? Hire us!

[/infobox]

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDK

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

Drivers for Android

Software available for the Arndale can be found here – drivers (including graphics drivers with OpenCL-support) are here. The current state is that their OpenCL drivers are sometimes working, most times not – we are very sorry for that and try to find fixes.

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

Firefly RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

Arndale 5250-A
~~ARMBRIX Zero~~
YICsystem YSE5250
YICsystem SMDKC520
Nexus 10
Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

Cortex-A15 @ 1.7 GHz dual core
128 bit SIMD NEON
2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

For more information and to order, go to http://howchip.com/shop/item.php?it_id=AND5250A. For an overview of extensions, go to: http://howchip.com/shop/content.php?co_id=ArndaleBoard_en. The price is $250 for the board, $50 for shipping to Europe, and extension-boards start at $60. You need a VAT-number to get it through the customs, but you have to pay EU-VAT anyway.

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Be sure you can buy a good 5V adapter (the + at the pin). A minimum of 3A is required for the board (TDP of the whole board is 11 to 12 Watt). Adapter costs around $25,- in the store, or you can buy them online for $7,50. You also need a serial cable – there are USB2COM-cables under €20,-. If you are in doubt, buy the $60,- package with cables (no COM2USB), adapter and microSD card.

Board 2: ARMBRIX Zero cancelled

This board has the same processor, but lacks the extension-boards and wifi/bluetooth. It does have 2GB 800MHz DDR3, various USB-ports, SATA(!) and Ethernet RJ45.

Order information

~~For more information and to order, go to http://www.howchip.com/shop/item.php?it_id=BRIX5250A.~~ The price is $145 for the board, $50 for shipping to Europe. You need a VAT-number to get it through the customs.

Make sure you can buy a good adapter (the + at the pin). The minimum for the Arndale is 3A, but I am not sure this board draws less. Adapter costs around $25,- in the store, or you can buy them online for $7,50.

Board 3: YICsystem YSE5250

This board has 2GB DDR3L(32bit 800MHz) and 8GB eMMC (onboard memory-card), USB 3.0 and LAN. Optional boards are for audio, WIFI, sensors (Gyroscope, Accelerometer, Magnetic, Light & Proximity), 5MB camera, LCD and GPS.

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

Below you’ll find the layout of the board.

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 4: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

Order information

You can order the Nexus 10 not in all countries, as google has restricted sales channels. See http://www.google.com/nexus/10/ for more info on ordering. With some creativity you can find ways to order this tablet into countries not selected by Google. Price is $400 or €400.

Board 6: Samsung Chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

ARM

ARM is most known for their CPU architectures, but they also have a GPU architecture, MALI. Their devices support:

OpenCL
OpenGL
Vulkan
RenderScript

OpenCL

ARM takes OpenCL serious and has various developer boards and drivers for their MALI GPU. Most notable Samsung has their Exynos chips, but now also Rockchip brings high-end MALI GPUs.

Drivers and SDK

ARM MALI Linux SDK

The SDK can be downloaded here. Developers manual is here. A 14-page FAQ with lots of answers on your questions is here.

For compilation on Ubuntu g++-arm-linux-gnueabi is needed. Also remove the “-none” in platform.mk. Then compilation will result in a libOpenCL.so.

Drivers for Android

We did not test the drivers with other devices than the Arndale with the same chipset (such as the new Chromebook and the Nexus 10).

Drivers for Linux

For the Samsung Chromebook drivers are available here. For Arndale these drivers should work too (not tested yet), if you use the kernel-drivers of the same version.

Firefly RK3288

The Firefly has:

RK3288 Cortex-A17 quad core@ 1.8GHz
Mali-T764 GPU with support for OpenGL ES 1.1/2.0 /3.0, OpenVG1.1, OpenCL, Directx11

Drivers have not yet been tested with this board yet! Cheap alternatives can be found here.

Exynos 5 Boards

The following boards are available:

Arndale 5250-A
YICsystem YSE5250
YICsystem SMDKC520
Nexus 10
Samsung Chromebook

Scroll down for more info on these boards.

Board 1: Arndale 5250-A

The board is fully loaded and can be extended with touch-screen, SSD, Wifi+sound and camera. Below is an image with the soundboard and connectivity-board attached.

Working boards using OpenCL on display (click on image to see the Twitter-status):

Here are a few characteristics:

Cortex-A15 @ 1.7 GHz dual core
128 bit SIMD NEON
2GB 800MHz DDR3 (32 bit)

More information can be found on the wiki and forum.

Order information

Currently you need to order the LCD too, as the latest proprietary drivers (which includes OpenCL) does not work with HDMI. There are (vague) solutions, to be found on the forums.

Board 2: YICsystem YSE5250

Currently it is unknown if OpenCL-drivers will be delivered, and there is no mention of it on their site.

Below you’ll find the layout of the board.

Order information

You can order at http://www.yicsystem.com/products/low-cost-board/yse5250/. The complete board costs $245,-. Costs for shipping and import are unknown.

Board 3: YICsystem SMDKC520

The SMDKC520 board is the offical reference board of Samsung Exynos5250 System. Currently it is unknown if OpenCL-drivers will be delivered, but as chances are high I already put it here.

It is like the YSE5250, but it seems it includes WIFI, camera and LCD – though the webpage is very vague. Once I have more info on the YSE5250, I’ll continue on getting more infor on this board.

Price is unknown, but does not fall under “budget boards”.

Order information

You can send an enquiry at http://www.yicsystem.com/products/smdk-board/smdk-c520/. Remember that OpenCL-support is currently unknown!

Board 5: Google Nexus 10

OpenCL-drivers have been found pre-installed on this tablet, so with some tinkering you can run openCL rightaway.

It is a complete tablet, so no case-modding is needed. It has 2GB RAM, WIFI, 16 or 32GB eMMC, 5MP camera, 10″ WXGA LCD, all sensors, NFC, sound, etc. For all the specs, see this page.

Order information

Board 6: Samsung Chromebook

For €300 a complete laptop that runs Linux and has OpenGL ES and OpenCL 1.1 drivers? That makes it a great OpenCL “board”.

See ARM’s Chromebook dev-page for more information on how to get Linux running with OpenCL and OpenGL.

The drivers are brand new – when we’ve tested it, we’ll add more information on this page.

Using Qt Creator for OpenCL

Posted by Vincent Hindriksen on 25 May 2010 with 2 Comments

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: ~~http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt~~ http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

Preparing a project for porting / optimizing

Level 1: Understandability

High level

Mid level

Low level

Action points

Level 2: Testability

High level

Mid Level

Low level

Action points

Level 3: Quality

High level

Mid level

Low level

Action points

Costs Assessment

Contact

Copyright

Why we did it

Presentation of lessons learned during SC14

Getting the sources and build

Help us with the GROMACS OpenCL port

Special thanks

We have jobs for people who get bored easily.

Find our jobs and Apply

The hiring process

The below steps will take no more than 30 minutes.

Getting the sources

Building

Building on Windows, special settings and problem solving

Run & Benchmark

Preparations

Simple benchmark, CPU-limited (d.poly-ch2)

Get impressed by the GPU (adh_cubic_vsites)

What’s next to do?

Khronos Releases SYCL 1.2 Provisional Specification

Khronos Releases WebCL 1.0 Specification

Khronos Launches OpenCL 2.0 Adopters Program

OpenCL

Freescale i.MX6

Board: SABRE

Other Boards

Drivers & SDK

More info

Freescale i.MX6

Board: SABRE

Other Boards

Drivers & SDK

More info

Test your own code

Registration

Schedule

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

CL_KERNEL_PRIVATE_MEM_SIZE

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Read more?

Need a MALI programmer? Hire us!

Drivers and SDK

ARM MALI Linux SDK

Drivers for Android

Drivers for Linux

Firefly RK3288

Exynos 5 Boards

Board 1: Arndale 5250-A

Order information

Board 2: ARMBRIX Zero cancelled

Order information

Board 3: YICsystem YSE5250

Order information

Board 4: YICsystem SMDKC520

Order information

Board 5: Google Nexus 10

Order information

Board 6: Samsung Chromebook