StreamComputing exists 5 years!

5yearsSCIn January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

  • My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
  • My friends, for understanding I needed to work in the weekends and evenings.
  • My good friend Laura for supporting me during the hard times of 2011 and 2012.
  • My girlfriend Elena for always being there for me.
  • My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
  • My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!”

ARM forums to find useful information for OpenCL development

OpenCL on ARM is hot, but it just is getting started. Currently it takes some time to find needed information about the processors concerning

For OpenCL-discussions the best place is the Khronos OpenCL board. So where can you go when you want to ask questions specifically on ARM-based GPUS like MALI, PowerVR, Adreno and Vivante?

ARM’s new community site for all

ARM just launched the Connected Community (ARM CC). It is the place to connect to, when you have general information-needs of ARM-IP, such as ARM MALI, Cortex A9 and Cortex A15.

arm-forums

And here is how ARM themselves explains this initiative on one slide:

ARMConnectedCommunityIntro

Be sure to connect to StreamHPC. We hope this will indeed be the central place for the whole ecosystem, including Imagination, Qualcomm and Vivante.

ARM MALI

Mali-developer

The MALI Developer Center has its forums on ARM Connected Community.

Imagination PowerVR

The graphics-section of their developer forums seems to be the best place.

imgtec dev forums

(Not @ ARM CC)

Qualcomm Adreno

Qualcomm has dev-forums too and has a section called Mobile Gaming & Graphics Optimization (Adreno™).

qualcomm-forum-adreno

(Not @ ARM CC)

Vivante

Vivante does not have a forum, but Freescale does. The i.MX forums seem to be the best place to ask your questions.

freescale-forums

@ARM CC

Others

Where do find a good source to find and share interesting information on mobile GPUs? Share it with the others via the comments – chances increase your questions gets answered when more people visit the forums.

USB-stick sized ARM-computers

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers”

StreamComputing is 7 years!

As of 1 April we are 7 years old. Because of all the jokes on that day, this post is a bit later.

Let me take you through our journey how we grew up from a 1-person company to what we’re now. With pride I can say that (with ups and downs) StreamComputing (now rebranded to StreamHPC) has become a brand that equals to (extremely) fast software, HPC, GPUs and OpenCL.

7 years of changes

Different services

After 7 years it’s also time for changes. Initially we solely worked on OpenCL related services, mostly GPUs. And this is what we’re currently doing:

  • HPC GPU computing: OpenCL, CUDA, ROCm.
  • Embedded GPU computing: OpenCL, CUDA, RenderScript, Metal.
  • Networked FPGA programming: OpenCL.
  • GPU-drivers testing and optimisation.
  • Software architecture optimisations.

While you see OpenCL a lot, our expertise in vendor-specific CUDA (NVidia), ROCm (AMD), RenderScript (Google) and Metal (Apple) cannot be ignored. Hence the “Performance Engineers” and not “GPU consultants” or “OpenCL programmers”.

From Fixers to Builders and getting new competition

Another change is that we have been going from fixing code afterwards to building software.

This has been a slow process and had to do with the confidence in performance engineering as an expert profession instead of a trick. We’re seeing new companies coming into the market and providing GPU-computing next to their usual services. This is a sign of the market growing up.

We’re confident in growing further in our market, as we have the expertise to design fast software while the newcomers have gained expertise to write code that runs on the GPU with only little speedup.

Community: OpenCL:PRO to OpenCL.org

There have been more times when we wanted to support the community more. The first try was OpenCL:PRO and did not live long, as it was actually unclear to us what “the community” wanted.

In the end it was not that hard. Everybody who starts with OpenCL has the same problems:

  • Lack of convenience code, resulting in many, many wrappers and libraries that are incompatible.
  • Lack of practice projects.
  • Lack of overview on what’s available.

With OpenCL.org we aim to solve these problems together with the community. All is shared on Github and anybody can join to complete the information we’ve shared. While our homepage had around 40 pages on these subjects, it was only our personal view on the subjects or had outdated info.

So we’re going to donate most of the OpenCL-related technical pages we’ve written over the years to the community.

There is much more to share – watch our blog, the OpenCLorg twitter and newsletter!

Different Logo

For who remembered: in 2010 the logo looked quite different. We still use the blocks in the background (like on our Twitter account), but since 2014 the colours and font are quite different. This change has been going along with the company growing up. The old logo is careful, while the new one is bold – now we’re more confident about our expertise and value.

Over the past 3 years the new logo has stayed the same and has fully become our identity.

Same kind of customers

It has been quite a journey! We could not have done it without all the customers we served over those 7 years.

Thank you!

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

The most noticeable processors from NVIDIA, AMD and Intel

AMD-Intel-NVidia10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently. 

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

kepler-k80Tesla K80 (Kepler)

  • The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
  • It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

  • 20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
  • 16 GB HBM2 (720 GB/s).
  • NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
  • OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

  • The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
  • The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

  • 0.5 TFLOPS SP (GPU), 1 TFLOPS HP
  • 10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

  • 32GB GDDR5 global memory
  • 2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

FireProS9300X2Radeon Nano and FirePro S9300X2 (Fiji)

  • Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
  • S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
  • Nano 175 Watt, S9300X2 300 Watt
  • Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

  • CPU with built-in GPU
  • About one TFLOPS
  • TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

  • 55MB cache, 22 cores
  • AVX 2.0 (256 bit vector operations)
  • DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPHi_KNL_socketXeonPhi Knights landing

  • Available in socket and PCI version
  • 3 TFLOPS DP, 6 TFLOPS SP
  • AVX 512 (512 bit vector operations)
  • 16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
  • Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

  • Task-parallel processor
  • Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

  • CPU with built-in GPU
  • 0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA

Wrong!
In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.

A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).

Questions I try to answer:

  • Which server-grade GPUs support direct GPU-GPU communication when using OpenCL?
  • What are other characteristics interesting for OpenCL-devs besides direct communication GPU-GPU, GPU-FPGA, GPU-NIC?
  • How do you code such fast communication?

Enjoy reading! Continue reading “GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA”

GPU and FPGA challenge for MSc and PhD students

While going through my email, I found out about the third “HiPEAC Student Heterogeneous Programming Challenge”. Unfortunately the deadline was last week, but just got an email: if you register by this weekend (17 September), you can still join.

EDIT: if you joined, be sure to comment in early November how it was. This would hopefully motivate others to join in next year. Continue reading “GPU and FPGA challenge for MSc and PhD students”

Altera published their OpenCL-on-FPGA optimization guide

Altera-doc

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

https://www.youtube.com/watch?v=p25CVFMc-dk

Subjects

The following subjects and optimisation tricks are discussed:

  • FPGA Overview
  • Pipelines
  • Good Design Practices
  • Avoid Pointer Aliasing
  • Avoid Expensive Functions
  • Avoid Work-Item ID-Dependent Backward Branching
  • Aligned Memory Allocation
  • Ensure 4-Byte Alignment for All Data Structures
  • Maintain Similar Structures for Vector Type Elements
  • Optimization of Data Processing Efficiency
  • Specify a Maximum Work-Group Size or a Required Work-Group Size
  • Loop Unrolling
  • Resource Sharing
  • Kernel Vectorization
  • Multiple Compute Units
  • Combination of Compute Unit Replication and Kernel SIMD Vectorization
  • Resource-Driven Optimization
  • Floating-Point Operations
  • Optimization of Memory Access Efficiency
  • General Guidelines on Optimizing Memory Accesses
  • Optimize Global Memory Accesses
  • Perform Kernel Computations Using Constant, Local or Private Memory
  • Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

Visit us (Amsterdam)

So we invited you over? Cool! See you soon!

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Poort business area. Below you’ll find information on how to get there.

The office building

When you arrive, ask at the desk to pick you up. If you want to test out the office security, the unit is 6.01.

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

RouteTravel time (outside rush hours)
Office – Schiphol15 minutes
Office – The Hague40 minutes
Office – Utrecht35 minutes
Office – Rotterdam50 minutes
Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

View in the direction of the office from the metro station

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

  • Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
  • Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
  • Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
  • Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
  • Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).
  • Subway: Line 50, also connecting to Amsterdam train-stations Lelylaan, Zuid, RAI and Bullewijk. In case there are problems with the train to Schiphol, go to Amsterdam Zuid and take the train from there.

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • For many “good enough”
  • Faster code, the easy way.
  • Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Better caching can give more boost than using faster hardware
  • Software running in parallel is a first step to GPU-computing
  • Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  •  The highest performance is guaranteed
  • Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

  • Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
  • Code-review of CPU-code (Java, C, C++ and more).
  • Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

  • Find parallellizable computations
  • Give the fitness to run on GPUs
  • Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Architecture check-up
  • Data-transport measurements
  • Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

Help write the book “Numerical Computations with GPUs”

9783319065472There is an interesting book coming up: “Numerical Computations with GPUs” – a book explaining various numerical algorithms with code in CUDA or OpenCL.

edit: At the moment there are 21 articles to be included in the book.

edit 2: book should be out in July

edit 3: Order via Springer International or Amazon US.
TOC:

  • Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
  • A Guide to Implement Tridiagonal Solvers on GPUs.
  • Batch Matrix Exponentiation.
  • Efficient Batch LU and QR Decomposition on GPU.
  • A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems.
  • Sparse Matrix-Vector Product.
  • Solving Ordinary Differential Equations on GPUs.
  • GPU-based integration of large numbers of independent ODE systems.
  • Finite and spectral element methods on unstructured grids for flow and wave propagation problems.
  • A GPU implementation for solving the Convection Diffusion equation using the Local Modified SOR method.
  • Pseudorandom numbers generation for Monte Carlo simulations on GPUs: Open CL approach.
  • Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA.
  • GPU-Accelerated computation routines for quantum trajectories method.
  • Monte Carlo Simulation of Dynamic Systems on GPUs.
  • Fast Fourier Transform (FFT) on GPUs.
  • A Highly Efficient FFT Using Shared-Memory Multiplexing.
  • Increasing parallelism and reducing thread contentions in mapping localized N-body simulations to GPUs.

 

Continue reading “Help write the book “Numerical Computations with GPUs””

AMD ROCm 1.5 Linux driver-stack is out

ROCm is AMD’s open source Linux-driver that brings compute to HSA-hardware. It does not provide graphics and therefore focuses on monitor-less applications like machine learning, math, media processing, machine vision, large scale simulations and more.

For those who do not know HSA, the Heterogeneous Software Architecture defines hardware and software such that different processor types (like CPU, GPU, DSP and FPGA) can seamlessly work together and have fine-grained memory sharing. Read more on HSA here.

About ROCm and it’s short history

The driver stack has been on Github for more than a year now. Development is done internally, while communication with users is done mostly via Gitlab’s issue tracker. ROCm 1.0 was publicly announced on 25 April 2016. After version 1.0, there now have been 6 releases in only one year – the 4 months of waiting time between 1.4 and 1.5 was therefore relatively long. You can certainly say the development is done at a high pace.

ROCm 1.4 was released end of December and besides a long list of fixed bugs, it had the developer preview of OpenCL 2.0 kernel support added. Support for OpenCL was limited to Fiji (R9 Fury series) and Baffin/Ellesmere (Radeon RX 400 series) GPUs, as these have the best HSA support of current GPU offerings.

Currently not all parts of the driver stack is open source, but the binary blobs will be open sourced eventually. You might think why a big corporation like AMD would open source such important part of their offering. This makes totally sense if you understand that their most important customers spend a lot of time on making the drivers and their code work together. By giving access to the code, debugging becomes a lot easier and will reduce development time. This will result in less bugs and a shorter time-to-market for the AMD-version of the software.

The OpenCL language runtime and compiler will be open sourced soon, so AMD offers full OpenCL without any binary blob.

What does ROCm 1.5 bring?

Version 1.5 adds improved support for OpenCL, where 1.4 only gave a developer preview. Both feature-support and performance have been improved. Just like in 1.4 there is support for OpenCL 2.0 kernels and OpenCL 1.2 host-code – the tool clinfo mentions there is even some support of 2.1 kernels, but we haven’t fully tested this yet.

The command-line based administration (ROCm-SMI) adds power monitoring, so power-efficiency can be measured.
The HCC compiler was upgraded to the latest CLANG/LLVM. There also have been big improvement in C++ compatibility.

Other improvements:

  1. Added new API hipHccModuleLaunchKernel which works exactly as hipModuleLaunchKernel but takes OpenCL programming models launch parameters. And its test
  2. Added new API hipMemPtrGetInfo
  3. Added new field to hipDeviceProp_t -> gcnArch which returns 803, 700, 900, etc.,

Bug fixes:

  1. Fixed Copyright and header names
  2. Fixed issue with bit_extract sample
  3. Enabled lgamma and lgammaf
  4. Added guard for GFX8 specific intrinsics
  5. Fixed few issues with operator overloading of vector data types
  6. Fixed atanf
  7. Added guard for __half data types to work with clang version more than 3. (Will be removed eventually).
  8. Fixed 4_shfl to work only for gfx803 as hawaii don’t support permute ops

Current hardware support:

  • GFX7: Radeon R9 290 4 GB, Radeon R9 290X 8 GB, Radeon R9 390 8 GB, Radeon R9 390X 8 GB, FirePro W9100 (16GB), FirePro S9150 (16 GB), and FirePro S9170 (32 GB).
  • GFX8: Radeon RX 480, Radeon RX 470, Radeon RX 460, Radeon R9 Nano, Radeon R9 Fury, Radeon R9 Fury X, Radeon Pro WX7100, Radeon Pro WX5100, Radeon Pro WX4100, and FirePro S9300 x2.

If you’re buying new hardware, pick a GPU from the GFX8 list. FirePro S9300 X2 is currently the server-grade solution of choice.

Keep an eye on the Phoronix website, which is usually first with benchmarking AMD’s open source drivers.

Install ROCm 1.5

Where 1.4 had support for Ubuntu 14.04, Ubuntu 16.04 and Fedora 23, 1.5 added support for Fedora 24 and dropped support for Ubuntu 14.04 and Fedora 23. On other distributions than Ubuntu 16.04 or Fedora 24 it *could* work, but there are zero guarantees.

Follow the instructions on Github step-by-step to get it installed via deb or rpm. Be sure to uninstall any previous release of ROCm to avoid problems.

The part on Grub might not be clear. For this release the magic GRUB_DEFAULT line on Ubuntu 16.04 is:

GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 4.9.0-kfd-compute-rocm-rel-1.5-76"

You need to alter this line with every update, else it’ll keep using the old version.

Make sure “/opt/rocm/bin/” is in your PATH when wanting to do some coding. When running the test, you should get:

/opt/rocm/hsa/sample$ sudo make
gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
/opt/rocm/hsa/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Also clinfo (installed from the default repo) should work.

Got it installed and tried your code? Did you see improvements? Share your experiences in the comments!

Not really ROCk-music, but this blog has been written while listening to the latest album of the Gorillaz

Handling OpenCL with CMake 3.1 and higher

CMake-logoThere has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

  • make a build-directory
  • cd build
  • cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

  • Ubuntu/Debian: Get the PPA.
  • Other Linux: Get the latest tar.gz and compile.
  • Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

An introduction to Grid-processors: Parallella, Kalray and KnuPath

gridWe have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

  • single bus: like the PCIe-bus in a PC or the iMX6.
  • ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
  • star bus: a central communication core with the compute-cores around.
  • full mesh bus: each core is connected to each core.
  • grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

  • video encoding
  • signal processing
  • cryptography
  • neural networks

Continue reading “An introduction to Grid-processors: Parallella, Kalray and KnuPath”

Overview of OpenCL 2.0 hardware support, samples, blogs and drivers

opencl20We were too busy lately to tell you about it: OpenCL 2.0 is getting ready for prime time! As it makes use of the more recent hardware features, it’s therefore more powerful than OpenCL 1.x could ever be.

To get you up to speed, see this list of new OpenCL 2.0 features:

  • Shared Virtual Memory: host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.
  • Dynamic Parallelism: device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.
  • Generic Address Space: functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.
  • Improved image support:  including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.
  • C11 Atomics: a subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.
  • Pipes: memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.
  • Android Installable Client Driver Extension: Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

I could write many articles about the above subjects, but leave that for later. This article won’t get into these technical details, but more into what’s available from the vendors. So let’s see what toys we were given!

A note: don’t start with OpenCL 2.0 directly, if you don’t know the basic concepts of OpenCL. Continue reading “Overview of OpenCL 2.0 hardware support, samples, blogs and drivers”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”

PDFs of Monday 12 September

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza).  30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

  • development of extreme fast software,
  • design of (faster) algorithms,
  • accelerating existing software, and
  • provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

Why this new AMD FirePro Cluster is important for OpenCL

FirePro cluster
FirePro S9150 cluster

Then it hit the doormat:

AMD is proud to collaborate with ASUS, the Frankfurt Institute for Advanced Studies, (FIAS) and GSI to support such important physics and computer science research,” said David Cummings, senior director and general manager, professional graphics, AMD. “This installation reaffirms AMD’s leading role in HPC with the implementation of the AMD FirePro S9150 server GPUs in this three petaFLOPS supercomputer cluster. AMD and ASUS are enabling OpenCL applications for critical science research usage for this cluster. We’re committed to building our HPC leadership position in the industry as a foremost provider of computing applications, tools and technologies.

You read more here and the official news here.

Why is this important?

It could be that there is more flops for the same price, as AMD hardware is cheaper? Nice, but secondary.

That it runs OpenCL? We like that, but from a broader perspective this is not the most important.

It is important because it creates more diversity in the world of HPC. Currently there are a few XeonPhi clusters and only one big AMD FirePro S10000 cluster. The rest is NVidia Tesla or CPU only. With more AMD clusters the HPC market is democratised. That means that more software will be written in vendor-neutral software like OpenCL (with high-level software/libraries on top), and prices of HPC accelerators will not be kept high.

How to further democratise the HPC world?

We started with porting Gromacs to OpenCL, and we will continue to port large projects to OpenCL. This software will simply run on XeonPhi, Tesla and FirePro with just little porting time, reducing costs in many ways. We can not do it alone, but together we can. Start by telling us which software needs to be ported from OpenMP to OpenCL or OpenMP 4, or from CUDA to OpenCL. And if you are porting open source software to OpenCL, drop us a line for free advice and help with testing the software.

And the best you can do to break the monopoly of CUDA, is to simply buy AMD or Intel hardware. The price difference is enough to buy lots of extra FLOPS and to pay for a complete porting project to OpenCL of a large application.

 

How to install OpenCL on Windows

windows-start-openclGetting your Windows machine ready for OpenCL is rather straightforward. In short, you only need the latest drivers for your OpenCL device(s) and you’re ready to go. Of course, you will need to add an OpenCL SDK in case you want to develop OpenCL applications but that’s equally easy.

Before we start, a few notes:

  • The steps described herein have been tested on Windows 8.1 only, but should also apply for Windows 7 and Windows 8.
  • We will not discuss how to write an actual OpenCL program or kernel, but focus on how to get everything installed and ready for OpenCL on a Windows machine. This is because writing efficient OpenCL kernels is almost entirely OS independent.

If you want to know more about OpenCL and you are looking for simple examples to get started, check the Tutorials section on this webpage.

Running an OpenCL application

If you only need to run an OpenCL application without getting into development stuff then most probably everything already works.

If OpenCL applications fail to launch, then you need to have a closer look to the drivers and hardware installed on your machine:

GPU Caps Viewer
GPU Caps Viewer

  • Check that you have a device that supports OpenCL. All graphics cards and CPUs from 2011 and later support OpenCL. If your computer is from 2010 or before, check this page. You can also find a list with OpenCL conformant products on Khronos webpage.
  • Make sure your OpenCL device driver is up to date, especially if you’re not using the latest and greatest hardware. With certain older devices OpenCL support wasn’t initially included in the drivers.

Here is where you can download drivers manually:

  • Intel has hidden them a bit, but you can find them here with support for OpenCL 2.0.
  • AMD’s GPU-drivers include the OpenCL-drivers for CPUs, APUs and GPUs, version 2.0.
  • NVIDIA’s GPU-drivers mention mostly CUDA, but the drivers for OpenCL 1.1 1.2 are there too.

In addition, it is always a good idea to check for any other special requirements that the OpenCL application may have. Look for device type and OpenCL version in particular. For example, the application may run only on OpenCL CPUs, or conversely, on OpenCL GPUs. Or it may require a certain OpenCL version that your device does not support.

A great tool that will allow you to retrieve the details for the OpenCL devices in your system is Caps Viewer.

Developing OpenCL applications

Now it’s time to put the pedal to the metal and start developing some proper OpenCL applications.

The basic steps would be the following:

  • Make sure you have a machine which supports OpenCL, as described above.
  • Get the OpenCL headers and libraries included in the OpenCL SDK from your favourite vendor.
  • Start writing OpenCL code. That’s the difficult part.
  • Tell the compiler where the OpenCL headers are located.
  • Tell the linker where to find the OpenCL .lib files.
  • Build the fabulous application.
  • Run and prepare to be awed in amazement.

Ok, so let’s have a look into each of these.

OpenCL SDKs

For OpenCL headers and libraries the main options you can choose from are:

As long as you pay attention to the OpenCL version and the OpenCL features supported by your device, you can use the OpenCL headers and libraries from any of these three vendors.

OpenCL headers

Let’s assume that we are developing a 64bit C/C++ application using Visual Studio 2013. To begin with, we need to check how many OpenCL platforms are available in the system:

[raw]

#include<stdio.h>
#include<CL/cl.h>

int main(void)
{
    cl_int err;
    cl_uint numPlatforms;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS == err)
         printf("\nDetected OpenCL platforms: %d", numPlatforms);
    else
         printf("\nError calling clGetPlatformIDs. Error code: %d", err);

    return 0;
}

[/raw]

We need to specify where the OpenCL headers are located by adding the path to the OpenCL “CL” is in the same location as the other CUDA include files, that is, CUDA_INC_PATH. On a x64 Windows 8.1 machine with CUDA 6.5 the environment variable CUDA_INC_PATH is defined as “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include

If you’re using the AMD SDK, you need to replace “$(CUDA_INC_PATH)” with “$(AMDAPPSDKROOT)/include” or, for Intel SDK, with “$(INTELOCLSDKROOT)/include“.

OpenCLNVIDIA_AdditionalInclude

OpenCL libraries

Similarly, we need to let the linker know about the OpenCL libraries. Firstly, add OpenCL.lib to the list of Additional Dependencies:

OpenCLNVIDIA_AdditionalDependencies

Secondly, specify the OpenCL.lib location in Additional Library Directories:

OpenCLNVIDIA_AdditionalLibrary

As in the case of the includes, If you’re using the AMD SDK, replace “$(CUDA_LIB_PATH)” with “$(AMDAPPSDKROOT)/lib/x86_64” , or in the case of Intel with “$(INTELOCLSDKROOT)/lib/x64“.

And you’re good to go! The application should now build and run. Now, just how difficult was it? Happy OpenCL-coding on Windows!

If you have any question or suggestion, just leave a comment.