MS just did not port Windows 8 to ARM

With a lot of fanfare Microsoft said they would offer Windows 8 in both an X86 as an ARM version. I was happy to see that Microsoft was innovating again after 10 years, and even saw loads of advantages of their Java-clone .NET. But then I started to read into Windows CE, Windows Embedded Compact, Windows Embedded Standard, Windows Mobile and Windows 8 (Desktop). I want to share this with you, even if it has not anything to do with OpenCL.

So they did port Windows to ARM which evolved to the 2012 version of the OS, but did not port Windows 8.0 from scratch. Below you can read why.

Continue reading “MS just did not port Windows 8 to ARM”

“Soon we will use only one thousandth of available computer capacity”

Professor Henri Bal
Professor Henri Bal, who tries to wake up the Netherlands to start going big on parallel programming

At StreamHPC we mostly work for companies in the bigger countries of Europe and North America. We hardly work for companies in the Netherlands. But it seems that after 5 years of sleeping, there is some shaking. Below is a (translated) article with the above quote by Prof. Dr. Ir. Henri Bal, professor at the Computer section at the Vrije University of Amsterdam.

Lack of knowledge of parallel programming will cause a situation where only one thousandth of the capacity of computers will be used. This makes computations unnecessarily slow and inaccurate. That in turn will slow down the development of the Dutch knowledge economy.

Sequential programming, instructing computers to perform calculations in a queue, is now the standard. Computers processors, however, are much more sophisticated and able to perform thousands or even millions of computations simultaneously. But the programming of such many-cores “is still in its infancy, industries that rely heavily on data, can not perform optimally”, claims Ball.

The value of parallel programming, according to Ball, is of enormous importance, for example, meteorology and forensics. “For weather forecasting data from the dense network of computers need to be quickly and accurately processed to have a weather forecast for tomorrow, not after 48 hours,” he says. “In forensics all data should be explored in the first 24 hours after a crime as soon as possible and through pattern recognition all data, for no trace to be lost. The video material of 80,000 security cameras which was manually searched through after the attack on the London Underground in 2005 – with parallel computing methods this can now rapidly be executed by the computer.”

If the Netherlands wants to widen the gap investments are necessary, says Bal. The focus should be on research and teaching. “Investments in research on programming new massively-parallel machines are required to gain knowledge. Thus it must be examined how programs should be written for parallel computing methods and what extent of parallel calculations can be performed automatically. In teaching our future programmers need also to be prepared for the new standards of parallel programming. Only then the Netherlands can make optimal use of the available computer capacity. “

I think my fellow countrymen will be surprised they can find help just around the corner. And if they wait two more years, then 1000x speed-up from sequential programs are indeed becoming possible.

Have you seen similar articles that sequential programming is slowing the knowledge economy?

All the members of the OpenCL working group 2013

In the below list are the members of the OpenCL workgroup as of November 2013.

OCL-WG-members-Nov12

We can expect small changes each year, but this is close to the actual state. I need the rest of Q4 to finalise all the info – any help is appreciated.

This list has also been compiled in 2010, and you can see several differences. If the company has an SDK available, there is a link. That is a whole difference with the last list – this one is much more concrete. Continue reading “All the members of the OpenCL working group 2013”

Performance can be measured as Throughput, Latency or Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

Promotion for OpenCL Training (’12 Q4 – ’13 Q2)

So you want your software to be much faster than the competition?

In 4 days your software team learns all techniques to make extremely fast software.

Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

Want more information? Contact us.

About the training

Location and Time

OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

Some discounts are available for developing countries.

Agenda

Day 1: Introduction

Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

  • Introduction to parallel programming and GPU-programming
  • An overview of parallel architectures
  • The OpenCL model: host-programming and kernel-programming
  • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
  • Data-parallel and task-parallel programming
Lab-session will be an image-filter.
Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

Day 2: Tools and advanced subjects

Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

  • Static kernel analysis
  • Profiling
  • Debugging
  • Data handling and preparation
  • Theoretical backgrounds for faster code
  • Cache flow optimisation
Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

Day 3: Optimisation of memory and group-sizes

Learn the concept of “data-transport is expensive, computations are cheap”.
  • Register usage
  • Data-rearrangement
  • Local and private memory
  • Image/texture memory
  • Bank-conflicts
  • Coalescence
  • Prefetching
Lab-session: various small puzzles, which can be solved using the explained techniques.

Day 4: Optimisation of algorithms

Learn techniques to help the compiler make better and faster code.
  • Precision tinkering
  • Vectorisation
  • Manual loop-unrolling
  • Unbranching
Lab-session: like day 3, but now with compute-oriented problems.

Enrolment

When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

Put any remarks you have in the message. If you have any question, mail to trainings@streamhpc.com.

[si-contact-form form=’7′]

Supporting OpenCL on your own hardware

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware”

Virtual Meeting room

If you have an appointment with us, choose the meeting room you have, enter your name and provide your password. You need a modern browser (IE, Edge, Firefox,  Chrome, etc. Not Safari) or have flash installed.

The room linked to this resource is not configured correctly.

The room linked to this resource is not configured correctly.

Waiting for Mobile OpenCL

People who follow me, know my interest in ARM Cortex CPU & Mali GPU and Imagination Technology’s PowerVR, regarding OpenCL-potential. Here is an overview of what I found out until now and has more open ends than answers. I was very hopeful about access to mobile OpenCL for developers, but now I think it will have to wait until 2011. It seems that the mobile phone makers keep the power to themselves in the form of system-libraries instead of giving direct access. Actually a wise choice, given the fact that a bad kernel could easily crash the phone.

Imagination Technology’s PowerVR SGX product provides the GPU-power for the iPhone 4 and various new Android phones. The company does not provide an OpenCL SDK directly to end-users, but leaves the responsibility to the implementers of their IP. Strangely enough they offer SDKs for OpenGL, and say “no comment” to a pretty direct forum-question. In other words: we don’t know what to expect.

Samsung has the only official implementation for ARM (ARM Cortex-A9 MPCore CPU) for OpenCL. Samsung just released their home-made Linux-based mobile phone OS “Bada”, but there is not a sign of OpenCL in their API. Samsung being the only one who could open an API to developers on their phones officially, does not make me smile yet.

There are many, many voices that Apple iOS 4 has support for OpenCL, but then only in the system-libraries. Given the fact that Apple is a big fan of OpenCL, we can assume this is true. See the web for a lot of articles about this.

ZiiLabs is open about their support for OpenCL in their ZMS-05 processor, and has an early access program. Early access still means waiting.

Qualcomm’s Snapdragon does not mention OpenCL at all, while it powers the more powerful smartphones. It mentions a recent job-post, so you know the drill: wait.

Android has good support for OpenGL ES, but not officially for OpenCL.You might have heard of the particle experiment for Android, which is actually OpenGL-based. It also mentions the Android-version of the bullet engine (broken link) for physics-computations, but also no OpenCL. It doesn’t look like Android “Gingerbread” will have support.

So what did you learn after reading this? That developers still won’t have access to OpenCL-API on a mobile platform, and that you have to wait until next year or enter ZiiLabs’ early access program. More about the good and bad of hiding OpenCL on this blog.

Self-Assessment GPGPU-role

As we’re not a university but a company, there needs to be a balance between things you can offer and the things we offer. Like in every job description, there is a list of bullet points to explain what we seek. To make it possible to self-asses your fitness for the job, we’ve put the number of points (✪) for each bullet point.

INSTRUCTIONS. For each section, assess yourself as being a:

  1. beginner: have been in contact with it briefly
  2. junior: had some experience, but not difficult problems
  3. medior: had more experience, but cannot coach others yet
  4. senior: experienced enough to coach others to really advance on this subject
  5. lead: can teach new things to a senior
  6. principal/master/guru: one of the world’s best

You need to look for the level where you get the most points. For example, if you are a master in C++ but are a medior in C and math, it might actually be best to assess as a medior and mention your C++ knowledge specifically. Or for example, if you are sure you can successfully finish a tutorial on GPGPU, you’re a beginner.

Real question to answer before applying: do you want to become a senior in GPGPU?

If you have the imposter syndrome, don’t be too harsh on yourself. If you’re overconfident, be realistic. If you worry you’re both, pick imposter syndrome only.

Heads up. During the interviews, we ask questions for the above self-assessment. If you assess yourself as a senior for GPU-coding because you were the best-of-class, you’ll get questions. And there is no person who can be defined by lists, so do mention where you stand out.

You, as a CPU Developer (9 items, 18 weights)

We seek people with experience. This can be open source projects for first jobseekers, or past jobs for those with job experience.

  1. You are capable of designing mathematical algorithms, both serial and parallel. ✪
  2. You are strong in math and hard sciences. ✪✪
  3. You know how compilers work, and you are unfortunate enough to know when they don’t. ✪✪
  4. You are experienced in C. ✪✪
  5. You are experienced in C++. ✪✪
  6. You have experience designing performance-driven architectures. ✪✪
  7. You know how to write tests. ✪✪✪
  8. You are experienced with continuous development. ✪✪
  9. You have experience with low-level optimizations. ✪✪

You, as a GPU Developer (6 items, 15 weights)

We seek people with experience. This can be open source projects for first jobseekers, or past jobs for those with job experience.

  1. You have read “hardware architecture specification documents” or ISA-docs. ✪
  2. You know how GPU-compilers work, and you are unfortunate enough to know when they don’t. ✪✪
  3. You are experienced in CUDA and/or HIP. ✪✪✪✪
  4. You are experienced in OpenCL and/or SYCL. ✪✪✪✪
  5. You know your way around with GPU-libraries. ✪
  6. You are experienced in porting algorithms to the GPUs, without the use of any library. ✪✪✪

As the four stars indicate, we do need minimal GPGPU-experience, as you won’t learn it here.

You, as a Problem Solver (8 items, 21 weights)

Coding is only one part of the solution. Most of the time we’re solving problems, where coding is just the means.

  1. You like the ideas and theories around the “learning mindset”. ✪✪✪
  2. You have a structured problem-solving approach that you could explain. ✪✪✪
  3. You have high self-awareness and can self-observe. ✪✪✪✪
  4. You have high standards for yourself. ✪✪
  5. You test out approaches by making quick experiments. ✪✪
  6. You test out possible solutions by mentally putting them in different scenarios. ✪
  7. You regularly take time to zoom out to get an overview on the problem, to be able to balance the inputs for the solution. ✪✪✪✪
  8. You always follow through. ✪✪

If you score high here, this will compensate for any lack of technical experience. Also for continuous growth, you’ll need to score high here.

You, as a Project Team Member (10 items, 20 weights)

Our company’s strength is that we work in teams. We don’t know everything as individuals, but as a team we can solve almost any problem around HPC and GPUs. This means we highly value collaboration and thus must be efficient in project handling.

  1. You have a proven track record of being focused on results. ✪
  2. You have a talent for turning vague problems into the right actions, and you want to build on it. ✪
  3. You normally write down tasks, and then prioritize & ESTIMATE them. ✪✪✪
  4. You understand that well-defined, well-communicated delivery criteria are the responsibility of every team member. ✪✪✪
  5. You can identify something’s missing to move a project forward smoothly. ✪✪✪
  6. You speak up when the project diverges from the trajectory. ✪✪
  7. You are used to administrating your time spent on an issue. ✪
  8. You can delegate work. ✪✪
  9. You can get work delegated. ✪✪
  10. You can explain, with examples, why the above are important. ✪✪

We explicitly did not state “project management”. It is about playing your part of making an efficient team.

What’s next?

If you got at least junior on CPU, Team, and problem-solving, beginner on GPU, and at least one of the 4 on medior? Then you should apply. Go to https://streamhpc.com/jobs/ for the instructions and links to other articles that should help you with understanding if this is a job for you.

Understand that if you are a true beginner in GPGPU, it’s best to follow the tips&tricks explained here.

We’re looking for an intern to do the cool stuff: benchmarking and Linux wizarding

intern
So, don’t let us retype your documents and blog posts, as that would make us your intern.

We have some embedded devices here, which badly need attention. Some have gotten some private time on the bench, but we did not share anything on the blog yet with our readers. We simply need some extra hands to do this. Because it’s actually cool to do, but admittedly a bit boring when doing several devices, it was the perfect job for an intern. Besides the benchmarking, we have some other Linux-related projects for you. You’ll get an average payment for an internship in the Netherlands (in Dutch: “stagevergoeding”), lunch, a desk and a bunch of devices (aka toys-for-techies).

Like more companies in the Netherlands, we don’t care about how you where born, but who you are as a person. We expect from you that you…

  • know everything about Linux administration, from servers to embedded devices.
  • know how to setup a benchmark.
  • document all what you do, not only the results.
  • speak and write Dutch and English.
  • have great humor! (Even if you’re the only one who laughs at your jokes).
  • study in the EU, or can arrange the paperwork to get to the EU yourself.
  • have a place to live/crash in or nearby Amsterdam, or don’t mind the daily travelling. You cannot sleep in the office.

Together with your educational institute we’ll discuss the exact learning goals of the internship, and make a plan for a period of 3 to 6 months.

If you are interested, send a mail to jobs@streamhpc.com. If you know somebody who would be interested, please tell that person that we’re waiting for him/her! Also tips&tricks on finding the right person are very welcome.

Targetting various architectures in OpenCL and CUDA

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

  • Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Continue reading “Targetting various architectures in OpenCL and CUDA”

Commodity and Open Standards – why OpenCL matters

V-UVThis article actually discusses the question: is GPGPU a solution for the masses, or is it for niche-products? For the latter open standards matter a lot less, as you will read.

If you watch the below video on sale&marketing by Victor Antonio, then you get what is so difficult about open standards: It pushes all companies using the standard into a focus on becoming the best. Indeed, survival of the fittest may be the base of (true) capitalism and giving the best products. Problem is that competition on price is not safe for the future of the company.

http://www.youtube.com/watch?v=SJ5QmW3LfN4

The key is specialisation, or creating unique value. The below video discusses this. The difference between “a feature” and “unique value” is a discussion on its own, you really should have with your team on your own products. Continue reading “Commodity and Open Standards – why OpenCL matters”

Why use OpenCL on FPGAs?

9781118942208.pdfAltera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here  from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just fill in the form on Altera’s webpage.

In StreamHPC we’re interested in OpenCL on FPGAs for one reason: many companies run their software on GPUs, when they should be using FPGAs instead; and at the same time, others stick to FPGAs and ignore GPUs completely. The main reason, we think, is that converting CUDA to VHDL, or Verilog to CPU intrinsics, is simply too painful. Another reason can be seen in the a amount of investment put on a certain technology. We believe that OpenCL can solve both of these issues. OpenCL is much more portable and can be converted to a new architecture in a relatively short time (if the developer is familiar with the project, the hardware and OpenCL). We have high familiarity with these two latter, which means we’re used to get new projects up-and-running.

Since both Altera and Xilinx have invested in OpenCL, the two FPGAs code has become more portable now. Altera has a public SDK (and they’re proudly loud about it), while Xilinx offers it in their latest tools (although they’re unfortunately much more silent about it).

Now, let us now go back to the quotes from the book that we wanted to share with you.

Andrew Moore describes OpenCL effectively in just a few sentences:

The need for heterogeneous computing is leading to new programming languages to exploit the new hardware. One example is the OpenCL first developed by Apple, Inc. OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs, and other types of processors. OpenCL includes a language for developing kernels (functions that execute on hardware devices) as well as application programming interfaces (APIs) that define and control the various platforms. OpenCL allows for parallel computing using task-based and data-based parallelism.

The author also shares some interesting insights around the reasons why OpenCL should be used on FPGA:

FPGAs are inherently parallel, so they’re a perfect fit with OpenCL’s parallel computing capabilities. FPGAs give you an alternative to the typical data or task parallelism by offering a pipeline parallelism where tasks can be spawned in a push-pull configuration with each task using different data from the previous task with or without host interaction. OpenCL allows you to develop your code in the familiar C programming language but using the additional capabilities provided by OpenCL. These kernels can be sent to the FPGAs without your having to learn the low-level HDL coding practices of FPGA designers. Generally, there are several benefits for software developers and system designers to use OpenCL to develop code for FPGAs:

  • Simplicity and ease of development: Most software developers are familiar with the C programming language, but not low-level HDL languages. OpenCL keeps you at a higher level of programming, making your system open to more software developers.
  • Code profiling: Using OpenCL, you can profile your code and determine the performance-sensitive pieces that could be hardware accelerated as kernels in an FPGA.
  • Performance: Performance per watt is the ultimate goal of system design. Using an FPGA, you’re balancing high performance in an energy-efficient solution.
  • Efficiency: The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives.
  • Heterogeneous systems: With OpenCL, you can develop kernels that target FPGAs, CPUs, GPUs, and DSPs seamlessly to give you a truly heterogeneous system design.
  • Code reuse: The holy grail of software development is achieving code reuse. Code reuse is often an elusive goal for software developers and system designers. OpenCL kernels allow for portable code that you can target for different families and generations of FPGAs from one project to the next, extending the life of your code.

Today, OpenCL is developed and maintained by the technology consortium Khronos Group. Most FPGA manufacturers provide Software Development Kits (SDKs) for OpenCL development on FPGAs.

You can continue here if you want to read of this ebook. And  of course, whenever you want to learn some more more, feel free to write to us, or follow this conversation on Twitter, which goes on through our special account: @OpenCLonFPGAs.

X86-workstation buying guide for OpenCL developers, Q1 2013

Curved iMac has your back…
Nuno Teixeira designed a large curved monitor in 2008 and assumed it would never be made. For a “few” thousand dollar NEC offers one to you right now. Also Samsung and LG have announced several new curved TVs at CES 2013 (with hdmi-port). We only need a workstation to go with it, where this blog-article might come in handy.

Important: this article was written before Intel “Haswell” and AMD “Richland” architectures came out.

So you want to start developing for OpenCL? When you focus on developing OpenCL for X86, you have these three options: CPUs, GPUs and CPUs with and embedded GPU. This article is for you and represents the current state of hardware – if you want the best hardware for your specific algorithm, the below information is probably not sufficient.

In 2013 we focus on 3 groups: servers/cloud (FirePro, Tesla, XeonPhi), workstations (discussed here), low-power devices (SoCs) and special accelerators (FPGAs and DSPs). This article does not discuss high-end accelerators of a few thousands of Euro, which are laid out in here.

Before reading on, you need to set the goal for your workstation.

  • If you want to learn the basics of OpenCL-programming, first check if your current machine has OpenCL-support.
  • If you need more processing power, be sure you select the right hardware for the job. Don’t buy the most expensive hardware (FirePro, Tesla or XeonPhi), but take your time to find out which hardware supports your algorithms best. Feel free to ask us.
  • If you want to make sure your software works on various types of accelerators, you can choose between:
    • swapping PCIe-cards – disadvantage is the drivers-hazzle and time-consumption.
    • more accelerators in one machine – disadvantage is that only GPU 1 can do OpenGL/DirectX.
    • identical machines with different accelerators – disadvantage is the price.
  • If you want to focus on multi-GPU development, you need:
    • or enough power-supply and the motherboard supports many lanes,
    • or buy a videocard with two GPUs.

This article has the goal to help you with buying a good machine for OpenCL-development. Prices are of January 2013. If you think I make the wrong suggestions, please give feedback via the comments.

My contacts at various companies can tell: I want to stay independent no matter what. No deals have been made nor was there any outside influence, except the friendly people of the local computer shops. I was surprised I ended up with suggestion so much AMD hardware, that I felt quite uncomfortable with it – I finally decided to keep to my first conclusions and leave the comments completely open.

Continue reading “X86-workstation buying guide for OpenCL developers, Q1 2013”

OpenCL – the battle, part I

Part I: the Hardware-companies and Operating Systems

(Part II will be about programming languages and software-companies, part III about the gaming-industry)

OpenCL is the new, but already de-facto standard of stream-computing; but how it got there so fast is somewhat strange. A few years ago there were many companies and research-groups seeing the power of using the GPU, such as:

And the fight is really not over, since we are talking about a big shift in the super-computing industry. Just think of IBM BlueGene, which will lose lots of market to nVidia and AMD. Or Intel, who hasn’t acquired a GPU-creator as AMD did. Who had expected the market to change this rigorous? If we’re honest, we could have seen it coming (when looking at the turbulence around PhysX and Havok), but “normally” this new techniques would be introduced slowly.

The fight is about market-shares. For operating-systems, the user wants to have their movies encoded in 20 minutes just like their neighbour. For HPC-computing, since clusters can be updated for a far lower price than was possible with the old-fashioned way; here it is mostly between Linux HPC and windows HPC (which still has a very small market-share), but also database-engines which rely on high-performance hardware/software.
The most to gain is in the processor-market. The extremely large consumer-market is declining since 2004, since most users do not need more than a netbook and have bought a separate gaming-computer for the more demanding games. We don’t only see Intel and AMD anymore, but IBM’s powerful Cell- en Power-processors, very power-efficient ARM-processors, etc. Now OpenCL could make it more interesting to buy an average processor and a good graphics-card, Intel (and AMD) have no choice then to take the battle with nVidia.

Background: Why Apple made OpenCL

Short answer: pure frustration. All those different implementations would or get a share or fight for being named the standard; Apple wanted to bet on the right horse and therefore took the lead in creating an open standard. Money would be made by updating software and selling more hardware. For that reason Apple’s close partners Intel and nVidia were easily motivated to help developing the standard. Currently Apple’s only (public) reasons for giving away such an expensive and specialised project is publicity and to be ahead of the competition. Since it will not be a core-business of Apple, it does not need to stay in lead, but which companies do?

Acquisitions, acquisition, acquisitions

No time to lose for the big companies, so they must get the knowledge in-house as soon as possible. Below are some examples.

  • Microsoft: Interactive Supercomputing (22-Sept-2009): made Star-P, software which allowed users to perform scientific, engineering or analytical computation on array or matrix-based data to use parallel architectures such as multi-core workstations, multi-processor systems, distributed memory clusters or utility/cloud-based environments. This is completely in the field of OpenCL, which Microsoft needs to strengthen its products as Apple already did, such as SQL-server and Windows HPC.
  • nVidia: Ageia technologies (22-Febr-2008): made specialized PC-cards and software for calculating complicated physics in games. They made the first commercial product aiming at the masses (gamers). PhysX-code could by integrated in nVidia-drivers to be used with modern nVidia-GPUs.
  • AMD: ATI (24-juli-2006): graphics chip specialist. Although the price was too high, it saved AMD from being bought out by Intel and even stay ahead (if they had kept running).
  • Intel: Havok (17-Sept-2007): builds games-tools, such as a physics-engine. After Ageia was captured, the only good company out there to buy; AMD was too late, which spent all its money on ATI. Wind River (4-June-2009): a company providing embedded systems, development tools for embedded systems, middleware, and other types of software. Also read this interesting article. Cilk (31-July-2009): offers parallel extensions that are tightly tied into a compiler. RapidMind (19-Aug-2009): created a high-level language Sh, which had an OpenCL-backend. Intel has a lead in CPU-compilers, which it wants to broaden to multi-core- and GPU-compilers. Intel discovered it was in the group of “old fashioned compiler-builders” and had lots to learn in a short time.

If you know more acquisitions of interest, please let us know.

Winners

Apple, Intel and NVidia are the winners for 2009 and 2010. They have currently the most knowledge in house and have their marketing-machine running. NVidia has the best insight for new markets.

Microsoft and Game-developers are second; they took the first train by joining the OpenCL-consortium and taking it very serious. At the end of 2010 Microsoft will be at Apple’s level of expertise, so we will see then who has the best novelties. The game-developers, of which most already have experience with physics-calculations, all had a second chance when they had misjudged the Physics-engines. More on gaming in part III.

AMD is currently actually a big loser, since it does not seem to take it all seriously enough. But AMD can afford to be late, since OpenCL makes it easy to switch. We hope the best for AMD, since it has the technology of both CPU and GPU, and many years of experience in both fields. More on the competition between marketing-monster nVidia and silent AMD will be discussed in a blog-item, next week.

Another possible loser is Linux, which has lots to lose on HPC-market; OpenBSD-based Apple and Windows HPC can actually win market-share now. Expect most from hardware-manufacturers Intel, AMD and nVidia to give code to the community, but also from universities who do lots of research on the ever-flexible Linux. At the end it all depends on OpenCL-adaptation of (Linux-specific) programming-languages, which will be discussed in part II.

ARM is a member of the OpenCL-group but does not seem to invest in it; they seem to target another growing market: the low-power mobile devices. We will write on OpenCL and the mobile market later and why ARM currently can be relaxed about OpenCL.

We hope you have more insights in this new market; please contact us for more specific information and feel free to give your comments. Please stay tuned for part II and III, which will be released the next few weeks.

PDFs of Monday 19 September

Already the fourth PDF-Monday. It takes quite some time, so I might keep it to 10 in the future – but till then enjoy! Not sure which to read? Pick the first one (for the rest there is not order).

Edit: and the last one, follow me on twitter to see the  PDFs I’m reading. Reason is that hardly anyone clicked on the links to the PDFs.

I would like if you let others know in the comments which PDF you liked a lot.

Adding Physics to Animated Characters with Oriented Particles (Matthias Müller and Nuttapong Chentanez). Discusses how to accelerate movements of pieces of cloth attached to the bodies. Not time to read? There are nice pictures.

John F. Peddy’s analysis on the GPU market.

Hardware/Software Co-Design. Simple Solution to the Matrix Multiplication Problem using CUDA.

CUDA Based Algorithms for Simulating Cardiac Excitation Waves in a Rabbit Ventricle. Bioinformatics.

Real-time implementation of Bayesian models for multimodal perception using CUDA.

GPU performance prediction using parametrized models (Master-thesis by Andreas Resios)

A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations (Alexandre S. Nery, Nadia Nedjah, Felipe M.G. Franca, Lech Jozwiak)

Rapid Geocoding of Satellite SAR Images with Refined RPC Model. An ESA-presentation by Lu Zhang, Timo Balz and Mingsheng Liao.

A Parallel Algorithm for Flight Route Planning with CUDA (Master-thesis by Seçkîn Sanci). About the travelling salesman problem and much more.

Color-based High-Speed Recognition of Prints on Extruded Materials. Product-presentation on how to OCR printed text on cables.

Supplementary File of Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs (Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara). They sped up the BOINC-system (Folding@Home). Bit vague what they want to tell, but maybe you find it interesting.

Parallel Position Weight Matrices Algorithms (Mathieu Giraud, Jean-Stéphane Varré). Bioinformatics, DNA.

GPU-based High Performance Wave Propagation Simulation of Ischemia in Anatomically Detailed Ventricle (Lei Zhang, Changqing Gai, Kuanquan Wang, Weigang Lu, Wangmeng Zuo). Computation in medicine. Ischemia is a restriction in blood supply, generally due to factors in the blood vessels, with resultant damage or dysfunction of tissue

Per-Face Texture Mapping for Realtime Rendering. A Siggraph2011 presentation by Disney and NVidia.

Introduction to Parallel Computing. The CUDA 101 by Victor Eijkhout of University of Texas.

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing. Presentation on what you find out when putting the volt-meter directly on the GPU.

NUDA: Programming Graphics Processors with Extensible Languages. Presentation on NUDA to write less code for GPGPU.

Qt FRAMEWORK: An introduction to a cross platform application and user interface framework. Presentation on the Qt-platform – which has great #OpenCL-support.

Data Assimilation on future computer architectures. The problems projected for 2020.

Current Status of Standards for Augmented Reality (Christine Perey1, Timo Engelke and Carl Reed). not much to do with OpenCL, but tells an interesting purpose for it.

Parallel Computations of Vortex Core Structures in Superconductors (Master-thesis by Niclas E. Wennerdal).

Program the SAME Here and Over There: Data Parallel Programming Models and Intel Many Integrated Core Architecture. Presentation on how to program the Intel MIC.

Large-Scale Chemical Informatics on GPUs (Imran S. Haque, Vijay S. Pande). Book-chapter on the design and optimization of GPU implementations of two popular chemical similarity techniques: Gaussian shape overlay (GSO) and LINGO.

WebGL, WebCL and Beyond! A presentation by Neil Trevett of NVidia/Khronos.

Biomanycores, open-source parallel code for many-core bioinformatics (Mathieu Giraud, Stéphane Janot, Jean-Frédéric Berthelot, Charles Delte, Laetitia Jourdan , Dominique Lavenier , Hélène Touzet, Jean-Stéphane Varré). A short description on the project http://www.biomanycores.org.

Big announcements: SYCL 1.2, WebCL 1.0 and OpenCL 2.0

opencl20Khronos just announced three OpenCL based releases:

  •  SYCL 1.2 Provisional Spec – Abstraction Layer for Leveraging C++ and OpenCL
  • WebCL 1.0 Final Spec – JavaScript bindings to OpenCL
  • OpenCL 2.0 Adopters Program – Conformance for OpenCL 2.0 implementations

Below I’ve quoted the summaries. For each of these I’ve prepared articles, but due to lack of time haven’t been able to finish and publish them. So for now some remarks after the summaries.

Khronos Releases SYCL 1.2 Provisional Specification

Programming abstraction layer to enable applications and high-level frameworks to leverage C++ and OpenCL for heterogeneous parallel acceleration

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback.  SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++.  For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration – and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

https://www.khronos.org/news/press/khronos-releases-sycl-1.2-provisional-specification

Higher level languages are very important, as OpenCL is simply too low-level. SYCL is another effort to help researching & improving this area, as we haven’t found the holy grail. Languages like C++AMP and RenderScript claim they can replace OpenCL, but we all know that some implementations of those languages have been done on top of OpenCL.

Khronos Releases WebCL 1.0 Specification

JavaScript bindings to OpenCL brings heterogeneous parallel computing to Web browsers

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the ratification and public release of the WebCL™ 1.0 specification.  Developed in close cooperation with the Web community, WebCL extends the capabilities of HTML5 browsers by enabling developers to offload computationally intensive processing to available computational resources such as multicore CPUs and GPUs.  WebCL defines JavaScript bindings to OpenCL™ APIs that enable Web applications to compile OpenCL C kernels and manage their parallel execution.  Like WebGL™, WebCL is expected to enable a rich ecosystem of JavaScript middleware that provides access to accelerated functionality to a wide diversity of Web developers.

https://www.khronos.org/news/press/khronos-releases-webcl-1.0-specification

WebCL gets more and more attention, even before it was even official. It would be interesting to see the same growth to higher level language as we have with OpenCL now. for this reason we started the Learning WebCL website, to help you learn WebCL in the future.

Khronos Launches OpenCL 2.0 Adopters Program

Conformance tests now available to certify OpenCL 2.0 implementations

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the availability of the official conformance test suite for the OpenCL 2.0 specification, making it possible for implementers to certify that their implementations are officially conformant thorough the Khronos OpenCL Adopters Program.  Khronos has also released a set of header files for OpenCL 2.0 and an updated specification with a number of clarifications and corrections to the specification first released in November 2013.

https://www.khronos.org/news/press/khronos-launches-opencl-2.0-adopters-program

Finally the headers are open. Stay tuned for an extensive OpenCL 1.2 vs OpenCL 2.0 comparison, which I have prepared but were unable to finish without the header files.

I hope you are as happy with these announcements as I am. This tells me that OpenCL is ready for real business.

GPU-related PHD positions at Eindhoven University and Twente University

We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

We’d like to put the following PhD position to your attention:


Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

Continue reading “GPU-related PHD positions at Eindhoven University and Twente University”

SC15 news from Monday

SC15Warning: below is raw material, and needs some editing.

Today there was quite some news around OpenCL, I’m afraid I can’t wait till later to have all news covered. Some news is unexpected, some is great. Let’s start with the great news, as the unexpected news needs some discussion.

Khronos released OpenCL 2.1 final specs

As of today you can download the header files and specs from https://www.khronos.org/opencl/. The biggest changes are:

  • C++ kernels (still separate source files, which is to be tackled by SYCL)
  • Subgroups are now a core functionality. This enables finer grain control of hardware threading.
  • New function clCloneKernel enables copying of kernel objects and state for safe implementation of copy constructors in wrapper classes. Hear all Java and .NET folks cheer?
  • Low-latency device timer queries for alignment of profiling data between device and host code.

OpenCL 2.1 will be supported by AMD. Intel was very loud with support when the provisional specs got released, but gave no comments today. Other vendors did not release an official statement.

Khronos released SPIR-V 1.0 final specs

SPIR-V 1.0 can represent the full capabilities of OpenCL 2.1 kernels.

This is very important! OpenCL is not the only language anymore that is seen as input for GPU-compilers. Neither is OpenCL hostcode the only API that can handle the compute shaders, as also Vulkan can do this. Lots of details still have to be seen, as not all SPIRV compilers will have full support for all OpenCL-related commands.

With the specs the following tools have been released:

SPIRV will make many frontends possible, giving co-processor powers to every programming language that exists. I will blog more about SPIRV possibilities the coming year.

Intel claims OpenMP is up to 10x faster than OpenCL

The below image appeared on Twitter, claiming that OpenMP was much faster than OpenCL. Some discussion later, we could conclude they compared apples and oranges. We’re happy to peer-review the results, putting the claims in a full perspective where MKL and operation mode is mentioned. Unfortunately they did not react, as <sarcasm>we will be very happy to admit that for the first time in history a directive language is faster than an explicit language – finally we have magic!</sarcasm>

https://twitter.com/StreamHPC/status/666178711549583364

CT3iJBlVEAA2WCY
Left half is FFT and GEMM based, probably using Intel’s KML. All algorithms seems to be run in a different mode (native mode) when using OpenMP, for which intel did not provide OpenCL driver support for.

We get back later this week on Intel and their upcoming Xeon+FPGA chip, if OpenCL is the best language for that job. It ofcourse is possible that they try to run OpenMP on the FPGA, but then this would be big surprise. Truth is that Intel doesn’t like this powerful open standard intruding the HPC market, where they have a monopoly.

AMD claims OpenCL is hardly used in HPC

Well, this is one of those claims that they did not really think through. OpenCL is used in HPC quite a lot, but mostly on NVidia hardware. Why not just CUDA there? Well, there is demand for OpenCL for several reasons:

  • Avoid vendor lock-in.
  • Making code work on more hardware.
  • General interest in co-processors, not specific one brand.
  • Initial code is being developed on different hardware.

Thing is that NVidia did a superb job in getting their processors in supercomputers and clouds. So OpenCL is mostly run on NVidia hardware and a therefore the biggest reason why that company is so successful in slowing the advancement of the standard by rolling out upgrades 4 years later. Even though I tried to get the story out, NVidia is not eager to tell the beautiful love story between OpenCL and the NVidia co-processor, as the latter has CUDA as its wife.

Also at HPC sites with Intel XeonPhi gets OpenCL love. Same here: Intel prefers to tell about their OpenMP instead of OpenCL.

AMD has few HPC sites and indeed there is where OpenCL is used.

No, we’re not happy that AMD tells such things, only to promote its own new languages.

CUDA goes AMD and open source

AMD now supports CUDA! The details: they have made a tool that can compile CUDA to “HiP” – HiP is a new language without much details at the moment. Yes, I have the same questions as you are asking now.

AMD @ SC15-page-007

Also Google joined in and showed progress on their open source implementation of CUDA. Phoronix is currently the best source for this initative and today they shared a story with a link to slides from Google on the project. the results are great up: “it is to 51% faster on internal end-to-end benchmarks, on par with open-source benchmarks, compile time is 8% faster on average and 2.4x faster for pathological compilations compared to NVIDIA’s official CUDA compiler (NVCC)”.

For compiling CUDA in LLVM you need three parts:

  • a pre-processor that works around the non-standard <<<…>>> notation and splits off the kernels.
  • a source-to-source compiler for the kernels.
  • an bridge between the CUDA API and another API, like OpenCL.

Google has done most of this and now focuses mostly on performance. The OpenCL community can use this to use this project to make a complete CUDA-to-SPIRV compiler and use the rest to improve POCL.

Khronos gets a more open homepage

Starting today you can help keeping the Khronos webpage more up-to-date. Just put a pull request at https://github.com/KhronosGroup/Khronosdotorg and wait until it gets accepted. This should help the pages be more up-to-date, as you can now improve the webpages in more ways.

More news?

AMD released HCC, a C++ language with OpenMP built-in that doesn’t compile to SPIRV.

There have been tutorials and talks on OpenCL, which I should have shared with you earlier.

Tomorrow another post with more news. If I forgot something on Sunday or Monday, I’ll add it here.

Brochures

We have two brochures. One general and one for training. You can also generate PDFs from most pages on this website to support off-line discussion.

New versions will arrive soon – unfortunately there have been quite some delays.

[columns]
[one_half title=”Training”]

In the brochure for trainings (version January 2012) all training modules are written out.

Download our trainings brochure here.

[/one_half]
[one_half title=”General”]

In the general brochure (version May 2011) we explain both our consultancy and training services.

Download our general brochure here.

[/one_half]
[/columns]

StreamHPC.Brochure.2010-01.EN.pdf

An example of real-world, end-user OpenCL usage

We ask all our customers if we could use their story on our webpage. For competition reasons, this is often not possible. The people of CEWE Stiftung & Co. KGaA were so kind to share his experience since he did a OpenCL training with us and we reviewed his code.

Enjoy his story on his experience from the training till now!

This year, the CEWE is planning to implement some program code of the CEWE Photoworld in OpenCL. This software is used for the creation and purchase of photo products such as the CEWE Photobook, CEWE Calendars, greeting cards and other products with an installation base of about 10 million throughout Europe. It is written in Qt and works on Windows, Mac and Linux.

 

In the next version, CEWE plans to improve the speed of image effects such as the oil painting filter, to become more useful in the world of photo manipulation. Customers like to some imaging effects to improve photo products, to get even more individual results, fix accidentally badly focused photos and so on.

Continue reading “An example of real-world, end-user OpenCL usage”