Porting code that uses random numbers

dobbelstenen

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

video: OpenCL on Android

Michael-Leahy-talk-videoMichael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

  • What is OpenCL
  • 13 dwarfs
  • RenderScript
  • Demo

http://www.youtube.com/watch?v=XQCYWmYCJWo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

Intel CPUs, GPUs & Xeon Phi

[infobox type=”information”]

Need a XeonPhi or Intel OpenCL programmer? Hire us!

[/infobox]

Intel has support for all their recent CPUs which have SSE 4.x and AVX. Since SandyBridge the CPUs tend to have good performance. On IndyBridge and later there is also support for the embedded GPU (Windows-only). XeonPhi has support for OpenCL, even though they promote OpenMP most.

SDK

Intel does not provide a standard SDK kit which contains both hardware and software, as their hardware is broadly available.

The driver can be downloaded from the Intel OpenCL page – select your OS at the upper-right and click ‘Download’.

The samples are included with the driver, if you use Windows. They can be downloaded separately here. If you have Linux, you can download the samples which have been ported to GCC from our blog – here you can also read on how to install the SDK.

Tools

There are various developer tools available. You can find them here:

  • Offline compiler (stand-alone (Windows+Linux) and VisualStudio-plugin)
  • OpenCL – Debugger (VisualStudio only)
  • Integration with Graphics Performance Analyzers (Windows-dowload)
  • VTune Amplifier XE for code-optimisation (more info here, starting at $899 for both Windows and Linux)

Supported hardware

In short: all Ivy Bridge and Sandy Bridge processors.

intel-opencl

Currently the HD4000 is the only embedded GPU that can do OpenCL, and only is supported via Windows drivers.

Xeon Phi

Intel’s official page has more info on the processor-card, and here you’ll find the most recent (public) info.

Xeon Phi
Non-production version of Xeon Phi with half the memory-banks visible around the (large) die.

CPUs and GPUs

With Xeons of 12 to 16 cores and AVX2 (512 bits wide vectors), OpenCL works very well on CPUs.

For GPU bug-reports go to this forum.

Learning material

See this blog post for information on where to find all drivers and samples.

To optimise OpenCL for Intel-processors, you can go through their very nice Optimization Guide. There is also a nice overview of tips&tricks in this article. The Intel OpenCL forums are also a very good source of information.

AMD OpenCL Programming Guide August 2013 is out!

AMD-OpenCLAMD has just released an update to their AMD programming guide.

Download the guide (PDF) August version

Download the guide (PDF) November version

Download TOC (PDF)

For more optimisation guides, see the tutorials page of the knowledge base.

Table of Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview
1.1.1 Synchronization

1.2 Hardware Overview for Southern Islands Devices

1.3 Hardware Overview for Evergreen and Northern Islands Devices

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL Continue reading “AMD OpenCL Programming Guide August 2013 is out!”

OpenCL potentials: Watermarked media for content-protection

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection”

The single-core, multi-core and many-core CPU

Multi-core CPU from 2011

CPUs are now split up in 3 types, depending on the number of cores: single (1), multi (2-8) and many (10+).

I find it more important now to split up into these three types, as the types of problems to be solved by each is very different. Based on the problem-differences I’m even expecting that the number of cores between multi-core CPUs and many-core CPUs will grow.

Below are the three types of CPUs discussed and a small discussion on many-core processors we see around. Continue reading “The single-core, multi-core and many-core CPU”

Performance of 5 accelerators in 4 images

runningIf there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip
Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

 

gflops-per-chip
The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

 

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

  • Intel Xeon 2699 v3: € 6,560.
  • Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
  • NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
  • NVidia Tesla K40: € 4,070
  • AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

 

Costs per GB/s
Where the four accelerators are around €11 per GB/s, the Xeon takes €131 (see note above). Note that the K40 with €14.13 is expensive compared to the other accelerators.

 

costs-per-gflops-per-chip
For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

Company History

There are not many companies like Stream HPC in Europe. Most others are or a government-institute for the national supercomputer, freelancers or actually not experienced with GPUs. So how did it start?

2009: The bore-out

Stream’s founder Vincent Hindriksen had to maintain a piece of software that was often failing to process the daily reports. After documenting the internals and algorithms of the code by interviewing the key people and some reverse engineering, it was a lot easier to create effective solutions for the bugs within the software. After fixing a handful of bugs, there was simply a lot less to do except reading books and playing online games.

To avoid becoming a master in Sudoku, he spent the following three weeks in rewriting all the code, using the freshly produced documentation. 2.5 hours needed to process the data was reduced to 19 seconds – yes, the kick for performance optimization was already there. For some reason it took well over 6 months to port the proof-of-concept, which was simply unbearable as somebody had to make sure the old code was maintained for 40 hours a week. As he was the only one who understood the code, there was no option to get placed at another project.

This ended in a bore-out: no wanting to go to work anymore. It’s actually quite the same as a burn-out, but with a different cause.

2010: a new start

From that bore-out the company was born the next year. There were two options: GPGPU (mostly OpenCL, a hobby) or build smart products for public transport. Two domains were bought, and the choice was made during the year. For the public transport a proof-of-concept was made, but the choice fell for the really difficult work.

Not much money was earned that year. Even government-support had to be paid back as one invoice was sent 2 weeks too early.

2011-2013: What’s a GPU?

We now have a clear idea on what GPUs can do, but in 2010-2014 GPUs were still for graphics only and sales were very difficult. Selling to somebody who states “GGGGGGraphics Processing Unit” is quite difficult.

A loan of €4000 by Vincent’s grandmother, a landlord who was relaxed with payments, the trust in the technology by early customers, and late payments to our creditors got us through.

2014: Employee #1

There was still not a stable income. Sales&marketing also took a lot of time, hurting the time that could be spent on actual work. But slowly we got more traction – more people started to believe in the company’s vision.

But by the end of the year the first employee was hired, Anca.

As the choice was to build a services company instead of a products company, banks and investors were not even interested in providing financial support. We can now say that for the long term this was the best – we can now fully control our own strategies and invest in our own product development.

2015-2020: First growth phase

Growth is hard, really hard. And we learned that, well, the hard way. Several decisions would now be made differently, but we adopted and continued. Some examples:

  • Investing in FPGAs too early. OpenCL-on-FPGAs was the next big thing, so based on what we got promised by vendors, we made the same promises to our customers. Many promises did not turn into reality.
  • Hiring the wrong people. Or: hiring people for whom we are the wrong company, as it goes both ways. We now define our culture, because we want people who fit our culture.
  • All the other things that are in the books under “early stage growth”.

By 2021 we got past the growth pains and go into the second phase.

2021: The second office

The first choice was actually in Belgium, because it was closer to Amsterdam. Unfortunately that project did not succeed. By coincidence we got into Budapest, and grew out of the office space the first year.

New training dates for OpenCL on CPUs and GPUs!

OpenCL remains to be a popular programming language for accelerators, from embedded to HPC. Good examples are consumer software and embedded devices. With Vulkan potentially getting OpenCL-support in the future, the supported devices will only increase.

For multicore-CPUs and GPUs we now have monthly training dates for the rest of the year:

Minimum number of participants is two. By request the location and date can be changed.

The first day of the training is the OpenCL Foundations training, which can be booked separately.

For more information call us at +31854865760.

When Big Data needs OpenCL

Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.

A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.

Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.

Continue reading “When Big Data needs OpenCL”

“Soon we will use only one thousandth of available computer capacity”

Professor Henri Bal
Professor Henri Bal, who tries to wake up the Netherlands to start going big on parallel programming

At StreamHPC we mostly work for companies in the bigger countries of Europe and North America. We hardly work for companies in the Netherlands. But it seems that after 5 years of sleeping, there is some shaking. Below is a (translated) article with the above quote by Prof. Dr. Ir. Henri Bal, professor at the Computer section at the Vrije University of Amsterdam.

Lack of knowledge of parallel programming will cause a situation where only one thousandth of the capacity of computers will be used. This makes computations unnecessarily slow and inaccurate. That in turn will slow down the development of the Dutch knowledge economy.

Sequential programming, instructing computers to perform calculations in a queue, is now the standard. Computers processors, however, are much more sophisticated and able to perform thousands or even millions of computations simultaneously. But the programming of such many-cores “is still in its infancy, industries that rely heavily on data, can not perform optimally”, claims Ball.

The value of parallel programming, according to Ball, is of enormous importance, for example, meteorology and forensics. “For weather forecasting data from the dense network of computers need to be quickly and accurately processed to have a weather forecast for tomorrow, not after 48 hours,” he says. “In forensics all data should be explored in the first 24 hours after a crime as soon as possible and through pattern recognition all data, for no trace to be lost. The video material of 80,000 security cameras which was manually searched through after the attack on the London Underground in 2005 – with parallel computing methods this can now rapidly be executed by the computer.”

If the Netherlands wants to widen the gap investments are necessary, says Bal. The focus should be on research and teaching. “Investments in research on programming new massively-parallel machines are required to gain knowledge. Thus it must be examined how programs should be written for parallel computing methods and what extent of parallel calculations can be performed automatically. In teaching our future programmers need also to be prepared for the new standards of parallel programming. Only then the Netherlands can make optimal use of the available computer capacity. “

I think my fellow countrymen will be surprised they can find help just around the corner. And if they wait two more years, then 1000x speed-up from sequential programs are indeed becoming possible.

Have you seen similar articles that sequential programming is slowing the knowledge economy?

Reducing downtime with OpenCL… Ever thought of that?

downtimeSomething that creates extra value for Open CL is the flexibility with which it runs on an important variety of hardware. A famous strategy is running the code on CPUs to find data-races and debug the code more easily. Another is to develop on GPUs and port to FPGAs to reduce the development-cycles.

But there’s one, quite important, often forgotten: replacement of faulty hardware. You can blame the supplier, or even Murphy if you want, but what is almost certain is that there’s a high chance of facing downtime precisely when the hardware cannot be replaced right-away.

Fail to plan is planning to fail

To limit downtime, there are a few options:

  • Have a good SLA in place for 24/7 hardware-replacement.
  • Have spare-hardware in stock.
  • Have over-capacity on your compute-servers.

But the problem is that all three are expensive in some form if you’re not flexible enough. If you use professional accelerators like Intel XeonPhi, NVidia Tesla or AMD FirePro, you risk having unexpected stock shortage at your supplier.

With OpenCL the hardware can be replaced by any accelerator, whereas with vendor-specific solutions this is not possible.

Flexibility by OpenCL

I’d like to share with you one example how to introduce flexibility in your hardware-management, but there are various others which are more tailored to your requirements.

To detect faulty hardware, you can think of a server with three GPUs and let selected jobs be run by all three – any hardware-problem will be detected and pin-pointed. Administrating which hardware has done which job completes the mechanism. Exactly this can be used to replace faulty hardware with any accelerator: let the replacement-accelerator run the same jobs as the other two as an acceptance-test.

If you need your software to be optimised for several accelerators, you’re in the right place. We can help you with both machine and hand optimizations. That’s a plan that cannot fail!

CPU Code modernisation – our hidden expertise

You’ve seen the speedups possible on GPUs. We secretly know that many of these techniques would also work on modern multi-core CPUs. If after the first optimisations the GPU still gets an 8x speedup, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU? Currently the GPU is chosen more often.

Now AMD, Intel and AMD have 28+ core CPUs, the answer to that question might now lean towards the CPU. With a CPU that has 32 cores and 256bit vector-computations via AVX2, each clock-cycle 32 double4 can be computed. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance. Actual performance compared to peak-performance is comparable to GPUs here. Continue reading “CPU Code modernisation – our hidden expertise”

Four conferences that will interest you

OpenCL Events
OpenCL Events

(if you get to Palo Alto, Manchester, Karlsruhe and Copenhagen)

We’re supporters of open standards and open discussions. When those thow come together, we melt. Therefore I’d like to share four hot conferences with you: IWOCL (Palo Alto, SF, USA), EMiT (Manchester, UK), ParallelCon (Karlsruhe, Germany), GPGPU-day 2015 (Copenhagen, Denmark).

On all these conferences I’ll be there too and are happy to meet you.

This post was shared first via the newsletter. Subscribe here.

Continue reading “Four conferences that will interest you”

Visit us (Amsterdam)

So we invited you over? Cool! See you soon!

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Poort business area. Below you’ll find information on how to get there.

The office building

When you arrive, ask at the desk to pick you up. If you want to test out the office security, the unit is 6.01.

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

RouteTravel time (outside rush hours)
Office – Schiphol15 minutes
Office – The Hague40 minutes
Office – Utrecht35 minutes
Office – Rotterdam50 minutes
Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

View in the direction of the office from the metro station

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

  • Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
  • Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
  • Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
  • Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
  • Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).
  • Subway: Line 50, also connecting to Amsterdam train-stations Lelylaan, Zuid, RAI and Bullewijk. In case there are problems with the train to Schiphol, go to Amsterdam Zuid and take the train from there.

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

What does Khronos has more to offer than OpenCL and OpenGL?

opencl_from_accelerate_your_worldThe OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

  • OpenCL: compute
  • WebCL: web compute
  • SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
  • SYCL: high-level language for OpenCL

OpenGL related

  • Vulkan: state-less graphics
  • OpenGL: graphics
  • OpenGL ES: embedded graphics
  • WebGL: web graphics
  • glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
  • OpenGL SC: Graphics for Safety Critical operations
  • EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

  • OpenMAX: interface for multimedia codecs, platforms and hardware
  • StreamInput: interface for sensors
  • OpenVX: OpenCV-alternative, built for performance.
  • OpenKCam: interface for cameras and sensors

Others

One video called “OpenRoad” to show them all:

http://www.youtube.com/watch?v=ckD0op6OgMQ

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/

What is OpenCL?

OpenCL (trademark of Apple Computers Inc.) is an open, royalty-free industry standard that makes much faster computations possible. The standard is controlled by non-profit standards organisation Khronos. By using this technique and graphics cards (GPUs) or extensions of modern processors you can for example convert a video in 20 minutes instead of 2 hours.

Programming the GPU was a very difficult task done by specialised teams and universities, but since 2010 it is in reach of more companies.

Below is a video which explains the differences between single-core, multiple core (starting at 1:27) and OpenCL (starting at 2:32).

http://www.youtube.com/watch?v=IEWGTpsFtt8

You can read more about the engineering ins and outs of the standard at http://www.khronos.org/opencl/.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

5 questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?
A: Currently we have reached version 1.2 and is 3 years old. OpenCL has many predecessors and therefore quite older than 3 years.

The magic of clGetKernelWorkGroupInfo

Workgroup with unknown characteristics
Workgroup with unknown characteristics

It’s not easy to get the available private memory size – actually it’s impossible to get this information directly from the device/drivers, using the OpenCL API. This can only be explained after you dive deep into clGetKernelWorkGroupInfo – the function that tells you how well your kernel fits on the device. It is strange this function is not often discussed.

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

Returns the amount of local memory, in bytes, being used by a kernel (per work-group). Use CL_DEVICE_LOCAL_MEM_SIZE to find out the maximum.

CL_KERNEL_PRIVATE_MEM_SIZE

Returns the minimum amount of private memory, in bytes, used by each work-item in the kernel.

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

This answers the question “What is the maximum value for global_work_size argument that can be given to clEnqueueNDRangeKernel?”. The result is of type size_t[3].

CL_KERNEL_WORK_GROUP_SIZE

The is the same for local_work_size. The kernel’s resource requirements (register usage etc.) are used, to determine what this work-group size should be.

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

If  __attribute__((reqd_work_group_size(X, Y, Z))) is used, then (X, Y, Z) is returned, else (0, 0, 0).

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

It returns a performance-hint: if the total number of work-items is a multiple of this number, then you’ll get good results. So no more remembering 32 or 64 for specific GPUs, but simply kick in a call to this function.

Combined with clDeviceInfo’s CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, you can fine-tune your workgroup-size in case you need the group-size to be as large as possible.

Read more?

You’ll find interesting usages when specifically looking for the flags on Github or Stackoverflow.

Short list of interesting Stackoverflow discussions:

GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017

What brings 2017 in technology? Gartner gives their vision with the start of each year to give insight in which technologies to invest in. When looking through them, the most important enabling technologies are the GPU and Internet-of-Things (IoT) – see the image below. Whereas the last 4 are IoT based, the first 4 would not have been possible without GPUs.

The middle two are more mature technologies, as they’re based on technology progress of many years – it happens to be that the GPU has played a big role to get here. And ofcourse not only GPUs and IoT are the reason these 10 are on this year’s list.

Continue reading “GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017”

Porting CUDA to OpenCL

OpenCL speed-meter in 1970 Plymouth Cuda car

Why port your CUDA-acelerated software to OpenCL? Simply, to make your software also run on AMD CPU/APU/GPU, Intel CPU/GPU, Altera FPGA, Xilinx FPGA, Imagination PowerVR, ARM MALI, Qualcomm Snapdragon and upcoming architectures.

And as OpenCL is an open standard, supported by many vendors, it has much more security that it will keep existing in the future than any proprietary language.

If you look at the history of GPU-programming you’ll find many frameworks, such as BrookGPU, Close-to-Metal, Brook+, Rapidmind, CUDA and OpenCL. CUDA was the best choice from 2008 to 2013, as OpenCL had to catch up. Now that OpenCL is gaining serious market traction, the demand for porting legacy CUDA-code to OpenCL rises – as we clearly notice here.

We are very experienced in porting legacy CUDA-code to all flavours of OpenCL (CPU, GPU, FPGA, embedded). Ofcourse porting from OpenCL to CUDA is also possible, as well as updating legacy CUDA-code to the latest standards of CUDA 7.0 and later. We can also add several improvements to the architecture; we have made many customers happy with giving them more structured and documented code, while working on the port. Want to see some work we did? We ported molecular dynamics software Gromacs from CUDA to OpenCL.

Contact us today, to discuss your project in more detail. We have porting services for each budget.

[button text=”Request a pilot, code-review or more information” url=”https://streamhpc.com/consultancy/request-more-information/” color=”orange” target=”_blank”]

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading “Join us at the Dutch eScience Symposium 2019 in Amsterdam”