DirectCompute’s unpopularity

Posted by Vincent Hindriksen on 28 December 2010

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity” →

StreamComputing is 2 years old! A personal story.

Posted by Vincent Hindriksen on 29 May 2012

More than two years ago, on 13 January 2010, I wrote my first blog-post. Four months later StreamComputing (redacted: rebranded to StreamHPC in 2017) was both official and unknown. I want to share with you my personal story on how I got to start-up this company.

The push-factor

I wanted to create a company which was about innovative projects – something I had hardly encountered until then. The years before I programmed parts of A-to-B-flows, as I call them. That is software that is in the base quite simple, but tediously discussed as very, very complex.

The complexity is not the software, as you can see. It is undocumented APIs, forgotten knowledge, knowledge in heads of unknown people, bossy and demanding people who friendly ask for last-minute architecture changes, deadlines around promotion-rounds, new deadlines due to board-decisions, people being afraid of getting replaced if the software is finished, jealousy if another team makes version 2 of the software, etc. The rule of office-software is therefore understandable:

Software is either unfinished,
or turned into a platform for unintended functionality.

The fun in office-software is there for analyst, architect or manager – the developer just puts in his earphones and makes all the requested changes (hooray for services like Spotify). But as I did not want to become a manager and wished to keep improving my development skills, I had to conclude I was on the wrong track.

Continue reading “StreamComputing is 2 years old! A personal story.” →

Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK

Posted by Vincent Hindriksen on 5 September 2013

Update: we are very sorry to tell that due to a deadline in a project we were forced to cancel Vincent’s talk.

StreamHPC will be at Mosaic3DX in Cambridge, UK, on 30+31 October. The brand new conference managed to get big names on-board, I’m happy to be amongst. Mosaic3DX describes itself as:

an international event comprising a conference, an exhibition, and opportunities for networking. Our intended audience are users as well as developers of Imaging, Visualisation, and 3D Digital Graphics systems. This includes researchers in Science and Engineering subjects, Digital Artists, as well as Software Developers in different industries.

Continue reading “Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK” →

Dutch: Gratis kennisochtend over de nieuwe generatie processoren

Posted by Vincent Hindriksen on 26 September 2011

Ergens opgevangen dat grafische kaarten tegenwoordig ingezet kunnen worden voor zware berekeningen? Tijdens een koffiegesprek gehoord over vector-processors als aanvulling op scalaire processors? Dan wordt het tijd dat u de grote veranderingen op processorgebied op een rijtje krijgt om uw organisatie beter op innovatief gebied te kunnen sturen.

Zie https://streamhpc.com/education/gratis-kennislunch/ voor een uur uitleg op lokatie.

Voor wie is deze kennis-ochtend?

Bedrijven voor wie snelheid belangrijk is en grote hoeveelheden data moeten verwerken. Bijvoorbeeld rekencentra, R&D-afdelingen, financiele instituten, ontwikkelaars van medische software, algoritme-ontwikkelaars en vision-bedrijven. Ook investeerder met hitech-bedrijven in hun portfolio kunnen gratis op de hoogte gebracht worden van de huidige ontwikkelingen.

U heeft geen technische achtergrond nodig, maar u zult zich niet vervelen indien u bits&bytes spreekt. Wij vragen uw achtergrond aan te geven, zodat we de juiste details in het programma kunnen toevoegen.

Wat is het programma?

In het eerste uur hoort u hoe de huidige processor-markt veranderd zijn ten opzichte van enkele jaren geleden – en welke nieuwe software-ontwikkelmethodes zijn geintroduceerd. Daarna krijgt u een overzicht van de nieuwe oplossingen die beschikbaar zijn en hoe dit zich verhouden tot de bestaande. Dit geeft u dan voldoende inzichten om te bepalen of het toepasbaar is binnen uw bedrijf. Het uur wordt afgesloten met wat StreamHPC voor u kan betekenen, maar ook wat u zelf kunt doen.

In het tweede uur bespreken we enkele use-cases en is er tijd voor vragen. De use-cases die worden besproken zijn afhankelijk van de achtergronden van de aanwezigen; denk aan bijvoorbeeld Monte Carlo, physics, enzym-werkingen, matrix-berekeningen en neurale netwerken.

Wanneer?

Indien er minimaal 10 aanmeldingen zijn, wordt er een datum geprikt.

Indien u binnen uw bedrijf direct interesse heeft, is het mogelijk dat StreamHPC bij u langs komt om deze presentatie te geven aangepast aan uw achtergrond. Neem daarvoor contact met ons op.

Online event registration for De opkomst van de supersnelle processoren – wat betekent dit voor u? powered by Eventbrite

Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

Posted by Vincent Hindriksen on 29 November 2017 with 4 Comments

When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor. Continue reading “Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks” →

Windows on ARM

Posted by Vincent Hindriksen on 13 January 2011 with 1 Comment

In 2010 Microsoft got interested in ARM, because of low-power solutions for server-parks. ARM tried to lobby for years to convince Microsoft to port Windows to their architecture and now the result is there. Let’s not look to the past, why they did not do it earlier and depended completely on Intel, AMD/ATI and NVIDIA. NB: This article is a personal opinion, to open up the conversation about Windows plus OpenCL.

While Google and Apple have taken their share on the ARM-OS market, Microsoft wants some too. A wise choice, but again late. We’ve seen how the Windows-PC market was targeted first from the cloud (run services in the browser on any platform) and Apple’s user-friendly eye-candy (A personal computer had to be distinguished from a dull working-machine), then from the smartphones and tablets (many users want e-mail and a browser, not sit behind their desk). MS’s responses were Azure (Cloud, Q1 2010), Windows 7 (OS with slick user-interface, Q3 2009), Windows Phone 7 (Smartphones, Q4 2010) and now Windows 8 (OS for X86 PCs and ARM tablets, 2012 or later).

Windows 8 for ARM will be made with assistance from NVIDIA, Qualcomm and Texas Instruments, according to their press-release [1]. They even demonstrated a beta of Windows 8 running Microsoft Office on ARM-hardware, so it is not just a promise.

How can Microsoft support this new platform and (for StreamHPC more important) what will the consequences be for OpenCL, CUDA and DirectCompute.

Continue reading “Windows on ARM” →

How we work

Phases

Since integrating OpenCL into existing software is not straightforward, we have four phases; each of which have their own time-schedule and benefits

1: exploration

In initial meetings, we (StreamHPC) and you (our customer) will discuss the bottle-necks in the software. As we believe the bottle-neck is related to computation-speed, we will continue with the next phase. Else, suggestions will be given on how to target the problem in another way. Our expertise can be used to have measurements done as a start for an alternative project.

If speed of computations are the bottle-neck and the theoretical minimal speed-up is promising, a project-team would formed.

2: research

In this step we will further investigate the possible methods of acceleration and visualize the current software-design. We make a plan in cooperation with the software architect to establish if and how OpenCL hardware acceleration can be integrated into the existing software. Costs and time depend on the (pre-)work done by your software team.

While nothing is produced, this is a very important phase. Every day progress is shared with the project-team and thus every day the project needs can be discussed and redesigned.

All information needed to finish the project would gathered at the end of this phase. After this, there will be a decision to make: go/no-go easy.

3: initial acceleration

A: isolated

Here we will target the theoretical minimal speed-up. Depending on the problem and the budget for this phase, this can be done in mathematics-software or in pure OpenCL.

B: integration

First, the hardware is chosen to run the OpenCL-code, then the code from the first part will be integrated into the software.

The speed-up will largely depend on the software it is built into and the hardware it is built on.

4: optimisation

When using standard techniques in OpenCL we get a minimum increase in computations-speed. By using specific hardware-optimisations and extra hardware, we can get extra speed-ups. If we think the speed-up will be marginal, we would suggest to skip this phase.

Since the hardware changes every year, we can make a 5-year plan to build a cluster. This increases in speed every year for a budget comparable to building a larger cluster at once. And this is comparably slow after 5 years.

SLA and risk-management

We are happy to serve your needs, but since we are not a large company we need to give securities in a different way:

We can include a fall-back in case the code does not work as expected. The control is in the hands of the operator. Automatic fall-back is also possible when the input does not match certain pre-conditions. Note that not all software is ready for fall-back logic.
Since GPGPU is rather new, many small OpenCL-companies throughout the world are in contact in case extra man-power is needed. StreamHPC is part of that network.
We take documentation very serious; all our code is documented extensively. This way any other company can take over. And this is possible, since OpenCL-kernels are rather small.

Please contact us for more information and details.

All the members of the OpenCL working group 2013

Posted by Vincent Hindriksen on 7 November 2013

In the below list are the members of the OpenCL workgroup as of November 2013.

We can expect small changes each year, but this is close to the actual state. I need the rest of Q4 to finalise all the info – any help is appreciated.

This list has also been compiled in 2010, and you can see several differences. If the company has an SDK available, there is a link. That is a whole difference with the last list – this one is much more concrete. Continue reading “All the members of the OpenCL working group 2013” →

Xeon Phi Knights Corner compatible workstation motherboards

Posted by Vincent Hindriksen on 1 August 2015 with 6 Comments

Intel has assumed a lot if it comes to XeonPhi’s. One was that you will use it on dual-Xeon servers or workstations and that you already have a professional supplier of motherboards and other computer-parts. We can only guess why they’re not supporting non-professional enthusiasts who got the cheap XeonPhi.

After browsing half the internet to find an overview of motherboards, I eventually emailed Gigabyte, Asus and ASrock for more information for a desktop-motherboard that supports the blue thing. With the information I got, I could populate the below list. Like usual we share our findings with you.

Quote that applies here: “The main reason business grade computer supplies can be sold at a higher price is that the customers don’t know what they’re buying“. When I heard, I did not know why the customer is not well-informed – now I do. Continue reading “Xeon Phi Knights Corner compatible workstation motherboards” →

GPU-related PHD positions at Eindhoven University and Twente University

Posted by Vincent Hindriksen on 12 June 2019

We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

We’d like to put the following PhD position to your attention:

Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

Continue reading →

NVIDIA’s answer to FirePro S9000: the TESLA K20

Posted by Vincent Hindriksen on 12 November 2012 with 5 Comments

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Continue reading “NVIDIA’s answer to FirePro S9000: the TESLA K20” →

A typical week

Primary and secondary tasks

The main focus is programming and solving problems. But that means that everything that obstructs this focus, needs to be gotten out of the way. This is simpler on paper than in reality and therefore there are multiple “faiths” among company, how to do this.

We start with clearly distincting primary and secondary tasks, where the difference is that there needs to be more time spent on the primary tasks in the long term. The last part of the sentence is very important.

What we do every day and week:

Planning
- Write issues
- Make issue estimations
- Prioritize issues
- Bundle issues in epics
- Pick issues for personal weekly milestones
Problem-solving
Coding and math
Learning
- Reading books
- Reading papers
- Watching videos

Why so much emphasis on planning?

The planning-part takes good time, but refrains us from spending too much time on dead ends. And spending time on dead ends is not a primary task at all. Also planning helps with designing better strategies – there is limited time for solving problems and coding software, so doing a full-scope research is not going to work. As there is no way to efficiently build complex code without any time-estimations on the different approaches, planning-skills provide the necessary foundations for becoming a senior coder.

We start as early as possible to train these skills, so also juniors are asked to do all planning-tasks. Initially this takes a good part of the valuable coding-time but quickly goes down and first advantages are seen.

Style of project handling

Tools

We mostly use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”.

We continuously look into new tools that can help us improve. Also here the main focus is to reduce the time on secondary tasks, so we can spend more time thinking on problem-solving.

Pull-style project management

The tasks are written down by the team, using the project-doc as input. All these tasks are put into the task-list of the project and estimated. Then each team member picks the tasks that are a good fit. There are always tasks that need to be pushed instead of pulled, but luckily that’s a relatively small part of all work.

All code (MR) is checked by one or two colleagues, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

The week

If you’d work here, how would your week look like the first year? Specifically saying the first year, as for more complex projects, different approaches could be chosen.

Monday weekly planning

Together with your team you pick up the issues for the week. The issues should have estimations, or these will be done during that meeting. When your week is filled, you know what to do.

Monday weekly meeting

Every Monday we have a weekly meeting to share with everybody how the other projects are doing.

Mon-Fri: Daily standup

Retrospective of the previous day, and tuning of the day ahead.

Practice:

Tools
C/C++
GPGPU
Scrum

Friday closing

Weekly retrospective, cleaning up, writing notes on issues, etc.

Weekly customer meetings

Here we discuss the progress and anything blocking. The customer shares their progress, and together problems can be solved.

Many projects have a shared (high-level) issue-list, so the progress is continuously synced with the customer and communication is easy.

Company History

There are not many companies like Stream HPC. Most others are or a government-institute for the national supercomputer, 1 or 2 freelancers or… actually not experienced with GPUs. So how did it start? How did we get a large team of HPC- and GPU-experts, working for customers worldwide?

Company History

Sorry,You have not added any story yet

Want to know more? Go to our contacts page and ask any question.

The 8 reasons why our customers had their code written or accelerated by us

Posted by Vincent Hindriksen on 9 February 2017

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

Reduce processing time
- Meeting timing requirements
- Increasing user efficiency
- Increasing the responsiveness
- Reducing latency
Do more in the same time
- Increasing simulation/data sizes
- Adding extra functionality
Reduce operational costs
- Reducing the server count
- Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us” →

Comparing Syntax for CUDA, OpenCL and HiP

Posted by Vincent Hindriksen on 5 April 2016 with 1 Comment

Both CUDA and OpenCL are well-known GPGPU-languages. Unfortunately there are some slight differences between the languages, which are shown below.

You might have heard of HiP, the language that AMD made to support both modern AMD Fiji GPUs and CUDA-devices. CUDA can be (mostly automatically) translated to HiP and from that moment your code also supports AMD high-end devices.

To give an overview how HiP compares to other APIs, Ben Sanders made an overview. Below you’ll find the table for CUDA, OpenCL and HiP, slightly altered to be more complete. The languages HC and C++AMP can be found in the original.

Term	CUDA	OpenCL	HiP
Device	int deviceId	cl_device	int deviceId
Queue	cudaStream_t	cl_command_queue	hipStream_t
Event	cudaEvent_t	cl_event	hipEvent_t
Memory	void *	cl_mem	void *
Grid of threads	grid	NDRange	grid
Subgroup of threads	block	work-group	block
Thread	thread	work-item	thread
Scheduled execution	warp	sub-group (warp, wavefront, etc)	warp
Thread-index	threadIdx.x	get_local_id(0)	hipThreadIdx_x
Block-index	blockIdx.x	get_group_id(0)	hipBlockIdx_x
Block-dim	blockDim.x	get_local_size(0)	hipBlockDim_x
Grid-dim	gridDim.x	get_global_size(0)	hipGridDim_x
Device Kernel	__global__	__kernel	__global__
Device Function	__device__	N/A. Implied in device compilation	__device__
Host Function	__host_ (default)	N/A. Implied in host compilation.	__host_ (default)
Host + Device Function	__host____device__	N/A.	__host____device__
Kernel Launch	<<< >>>	clEnqueueNDRangeKernel	hipLaunchKernel
Global Memory	__global__	__global	__global__
Group Memory	__shared__	__local	__shared__
Private Memory	(default)	__private	(default)
Constant	__constant__	__constant	__constant__
Thread Synchronisation	__syncthreads	barrier(CLK_LOCAL_MEMFENCE)	__syncthreads
Atomic Builtins	atomicAdd	atomic_add	atomicAdd
Precise Math	cos(f)	cos(f)	cos(f)
Fast Math	__cos(f)	native_cos(f)	__cos(f)
Vector	float4	float4	float4

You see that HiP borrowed from CUDA.

The discussion is ofcourse if all alike APIs shouldn’t use the same wordings. A best thing would be to mix for the best, as CUDA’s “shared” is much more clearer than OpenCL’s “local”. OpenCL’s functions on locations and dimensions (get_global_id(0) and such) on the other had, are often more appreciated than what CUDA offers. CUDA’s “<<< >>>” breaks all C/C++ compilers, making it very hard to make a frontend of IDE-plugin.

I hope you found the above useful to better understand the differences between CUDA and OpenCL, but also to see how HiP comes into the picture.

nVidia’s CUDA vs OpenCL Marketing

Posted by Vincent Hindriksen on 8 February 2010 with 6 Comments

Please read this article about Microsoft and OpenCL, before reading on. The requested benchmarking is done by the people of Unigine have some results on differences between the three big GPGPU-APIs: part I, part II and part III with some typical results.
The following read is not about the technical differences of the 3 APIs, but more about the reason behind why alternate APIs are being kept maintained while OpenCL does the trick. Please comment, if you think OpenCL doesn’t do the trick

As was described in the article, it is important for big companies (such as Microsoft and nVidia) to defend their market-share. This protection is not provided through an open standard like OpenCL. As it went with OpenGL – which was sort of replaced by DirectX to gain market-share for Windows – now nVidia does the same with CUDA. First we will sum up the many ways nVidia markets Cuda and then we discuss the situation.

The way nVidia wanted to play the game, was soon very clear: it wanted to market CUDA to be seen as the better alternative for OpenCL. And it is doing a very good job.

The best way to get better acceptance is giving away free information, such as articles and courses. See Cuda’s university courses as an example. Also sponsoring helps a lot, so the first main-stream oriented books about GPGPU discussed Cuda in the first place and interesting conferences were always supported by Cuda’s owner. Furthermore loads of money is put into an expensive webpage with very complete information about GPGPU there is to find. nVidia does give a choice, by also having implemented OpenCL in its drivers – it does just not have big pages on how-to-learn-OpenCL.

AMD – having spent their money on buying ATI, could not put this much effort in the “war” and had to go for OpenCL. You have to know, AMD has faster graphics-cards for lower prices than nVidia at the moment; so based on that, they could become the winners on GPGPU (if that was to only thing to do). Intel saw the light too late and is even postponing their High-end GPU, the Larrabee. The only help for them is that Apple demands to have OpenCL on nVidia’s drivers – but for how long? Apple does not want strict dependency on nVidia, since it i.e. also has a mobile market. But what if all Apple-developers create their applications on CUDA?

Most developers – helped by the money&support of nVidia – see that there is just little difference between Cuda and OpenCL and in case of a changing market they could translate their programs from one to the other. For now a demand to have a high-end videocard of nVidia can be rectified, a card which actually many people have or easily could buy within the budget of their current project. The difference between Cuda and OpenCL is comparable with C# and Java – the corporate tactics are also the same. Possibly nVidia will have better driver-support for Cuda than OpenCL and since Cuda does not work on AMD-cards, the conclusion is that Cuda is faster. There can then be a situation that AMD and Intel have to buy Cuda-patents, since OpenCL does not have the support.

We hope that OpenCL will stay the main-stream GPGPU-API, so the battle will be on hardware/drivers and support of higher-level programming-languages. We do really appreciate what nVidia already has done for the GPGPU-industry, but we hope they will solely embrace OpenCL for the sake of the long-term market-development.

What we left out of this discussion is Microsoft’s DirectCompute. It will be used by game-developers in the Windows-platform, which just need physics-calculations. When discussing the game-industry, we will tell more about Microsoft’s DirectX-extension.
Also come back and read our upcoming article about FPGAs + OpenCL, a combination from which we expect a lot.

Our Onboarding Process

We currently have 6 months for onboarding. This helps you, as our new colleague, get used to the different environment, the company’s priorities and ways of working.

What requirements we signal before hiring

The self-assessment contains all the elements we try to foster. It shows 4 main areas:

CPU programming & algorithms
GPU programming & performance oriented programming
Problem solving
Collaboration, planning & prioritization

Each of these, and some personal skills, we discuss, improve and measure. Six months is never enough to improve on everything, but it is enough to focus on the items where most need to be done. It’s certainly enough time to understand what quality means and how to get there.

It’s easy to get confused about quality. Quality might mean deluxness, fanciness, expensiveness. That’s not what quality means. Quality means meeting spec. Keeping the promise. Doing exactly what you said you would do. It is measured in consistency.
Seth Godin

First week(s)

Getting started means you get everything on your desk at once. Best to have a checklist:

Creation of accounts
Ask questions you might have to your buddy.
Introduction to others.
Learning the basics of the development and collaboration environments
Get necessary hardware, including noise-canceling headphones.
Make technical contributions from the first day, as this provides:
- Measurable contributions
- Time management

What to expect in your first months

In your first weeks, you’ll ease into a project, pair up with a buddy, meet teammates (both in-office and remote), and get familiar with how we work, daily standups, time tracking, and internal tools like GitLab. You’ll also learn about our team culture, performance expectations, and key documents that shape how we collaborate. Regular check-ins at one, and five months help track your growth, gather feedback, and support your long-term development.

Memberships

We are active in several foundations, communities and collaborations. Below is an overview.

Khronos

Associate member of Khronos, the non-profit technology consortium that maintains important languages like OpenCL, OpenGL, SPIR and Vulkan.

The Khronos Group was founded in 2000 to provide a structure for key industry players to cooperate in the creation of open standards that deliver on the promise of cross-platform technology. Today, Khronos is a not for profit, member-funded consortium dedicated to the creation of royalty-free open standards for graphics, parallel computing, vision processing, and dynamic media on a wide variety of platforms from the desktop to embedded and safety critical devices.

High Tech NL

High Tech NL is the sector organization by and for innovative Dutch high-tech companies and knowledge institutes. High Tech NL is committed to the collective interests of the sector, with a focus on long-term innovation and international collaboration.

We’re a member because HighTech NL is one of the few organizations that understands IT is far more than digitisation. Our main focus there is robotics.

HSA Foundation

Heterogeneous System Architecture (HSA) Foundation is a not-for-profit industry standards body focused on making it dramatically easier to program heterogeneous computing devices. The consortium comprises various semiconductor companies, tools providers, software vendors, IP providers, and academic institutions that develops royalty-free standards and open-source software.

HiPEAC

HiPEAC’s mission is to steer and increase the European research in the area of high-performance and embedded computing systems, and stimulate (international) collaborations.

We’ve sponsored multiple conferences over the years.

ETP4HPC

ETP4HPC is the European Technology Platform (ETP) in the area of High-Performance Computing (HPC). It is an industry-led think-tank comprising of European HPC technology stakeholders: technology vendors, research centres and end-users. The main objective of ETP4HPC is to define research priorities and action plans in the area of HPC technology provision (i.e. the provision of supercomputing systems).

OpenPower

OpenPOWER Foundation is an open, not-for-profit technical membership group incorporated in December 2013. It was incepted to enable today’s data centers to rethink their approach to technology. OpenPOWER was created to develop a broad ecosystem of members that will create innovative and winning solutions based on POWER architecture.

OpenCL basics: Multiple OpenCL devices with the ICD.

Posted by Anca Hamuraru on 14 August 2015 with 5 Comments

tesla-xeonphi-firepro — XeonPhi, Tesla, FirePro

Most systems nowadays have more than just one OpenCL device and often from different vendors. How can they all coexist from a programming standpoint? How do they interact?

OpenCL platforms and OpenCL devices

Firstly, please bear with me for a few words about OpenCL devices and OpenCL platforms.

An OpenCL platform usually corresponds to a vendor. This is responsible for providing the OpenCL implementation for its devices. For instance, a machine with an i7-4790 Intel CPU is going to have one OpenCL platform, probably named “Intel OpenCL” and this platform will include two OpenCL devices: one is the Intel CPU itself and the other is the Intel HD Graphics 4600 GPU. This Intel OpenCL platform is providing the OpenCL implementation for the two devices and is responsible for managing them.

Let’s have another example, but this time from outside the Windows ecosystem. A MacBook running OS X and having both the Intel Iris Pro GPU and a dedicated GeForce card will show one single OpenCL platform called “Apple”. The two GPUs and the CPU will appear as devices belonging to this platform. That’s because the “Apple” platform is the one providing the OpenCL implementation for all three devices.

Last but not least, keep in mind that:

An OpenCL platform can have one or several devices.
The same device can have one or several OpenCL implementations from different vendors. In other words, an OpenCL device can belong to more than just one platform.
The OpenCL version of the platform is not necessarily the same with the OpenCL version of the device.

The OpenCL ICD

ICD stands for Installable Client Driver and it refers to a model allowing several OpenCL platforms to coexist. It is actually not a core-functionality, but an extension to OpenCL.

For Windows and Linux the ICD has been available since OpenCL 1.0.
OSX doesn’t have an ICD at all. Apple chose to put all the drivers themselves under one host.
Android did not have the extension under OpenCL 1.1, but people ported its functionality. With OpenCL 2.0 the ICD is also on Android.

How does this model work?

While a machine can have several OpenCL platforms, each with its own driver and OpenCL version, there is always just one ICD Loader. The ICD Loader acts as a supervisor for all installed OpenCL platforms and provides a unique entry point for all OpenCL calls. Based on the platform id, it dispatches the OpenCL host calls to the right driver.

This way you can compile against the ICD (opencl.dll on Windows or libOpenCL.so on Linux), not directly to all the possible drivers. At run-time, an OpenCL application will search for the ICD and load it. The ICD in turn looks in the registry (Windows) or a special directory (Linux) to find the registered OpenCL drivers. Each OpenCL call from your software will be resolved by the ICD, which will further dispatch requests to the selected OpenCL platform.

A few things to keep in mind

The ICD gets installed on your system together with the drivers of the OpenCL devices. Hence, a driver update can also result in an update of the ICD itself. To avoid problems, an OS can decide to handle the OpenCL itself.

Please note that the ICD, the platform and the OpenCL library linked against the application may not necessarily correspond to the same OpenCL version.

I hope this explains how the ICD works. If you have any question or suggestion, just leave a comment. Also check out the Khronos page for the ICD extension. And if you need the sources to build your own ICD (with license that allows you to distribute it with your software), check the OpenCL registry on Khronos.

Is the CPU slowly turning into a GPU?

Posted by Vincent Hindriksen on 6 February 2015

It's all in the plan — It’s all in the plan?

Years ago I was surprised by the fact that CPUs were also programmable with OpenCL – I solely chose that language for the cool of being able to program GPUs. It was weird at start, but cannot think of a world without OpenCL working on a CPU.

But why is it important? Who cares about the 4 cores of a modern CPU? Let me first go into why CPUs have had mostly 2 cores for so long, about 15 years ago. Simply put, it was very hard to program multi-threaded software that made use of all cores. Software like games did, as they needed all the available resources, but even the computations in MS Excel are mostly single-threaded as of now. Multi-threading was maybe used most for having a non-blocking user-interface. Even though OpenMP was standardised 15 years ago, it took many years before the multi-threaded paradigm was used for performance. If you want to read more on this, search the web for “the CPU frequency wall”.

More interesting is what is happening now with CPUs. Both Intel and AMD are releasing CPUs with lost of cores. Intel has recently a 18-core processor (Xeon E5 2699-v3) and AMD was offering 16-core CPUs for a longer time (Opteron 6300 series). Both have SSE and AVX, which means extra parallelism. If you don’t know what this is precisely about, read my 2011-article on how OpenCL uses SSE and AVX on the CPU.

AVX3.2

Intel now steps forward with AVX3.2 on their Skylake CPUs. AVX 3.1 is in XeonPhi “Knight’s Landing” – see this rumoured roadmap.

It is 512-bits wide, which means that 8 times as much vector-data can be computed! With 16 cores, this would mean 128 float operations per clock-tick. Like a GPU.

The disadvantage is alike the VLIW we had in the pre-GCN generation of AMD GPUs: one needs to fill the vector-instructions to get the speed-up. Also the relatively slow DDR3 memory is an issue, but lots of progress is being made there with DDR4 and stacked memory.

So is the CPU turning into a GPU?

I’d say yes.

With AVX3.2 the CPU gets all the characteristics of a GPU, except the graphics pipeline. That means that the CPU-part of the CPU-GPU is acting more like a GPU. The funny part is that with the GPU’s scalar-architecture and more complex schedulers, the GPU is slowly turning into a CPU.

In this 2012-article I discussed the marriage between the CPU and GPU. This merger will continue in many ways – a frontier where the HSA-foundation is doing great work now. So from that perspective, the CPU is transforming into a CPU-GPU; and we’ll keep calling it a CPU.

This all strengthens my believe in the future of OpenCL, as that language is prepared for both task-parallel and data-parallel programs – for both CPUs and GPUs, to say it in current terminology.