Will OpenCL work for me?

OpenCL_LogoOpenCL can accelerate your software multiple factors, but… only if the data and the software are fit.

The same applies to CUDA and other GPGPU-methods.

Get to know if you can speed up your software with OpenCL in 4 steps.
[columns]
[one_half title=”1. Lots of repetitions”]
The main focus to find code that can be run in parallel is finding loops that take relatively much time. If an action needs to be done for each part of the data-input, then the code certainly contains a lot of loops. You can go to the next step.

If data goes through the code from A to B in a straight line without many loops, then there is a very low chance that computing-speed is the bottle-neck. A faster network, better caching, faster memory and such should be first looked into.
[/one_half]
[one_half title=”2. No or few dependencies”]
If in loops there are no dependencies on the previous step, then you can go to the next step.

As low interdependencies do not matter for single-core software, this was not an important developer’s focus even five years ago. Know there are many new algorithms now, which decrease loop-internal dependencies. If your software has been optimised for several processors or even a cluster, then the step to OpenCL is much smaller.

For example search-problems can be sped up by dividing the data between many processes. Even though the dependency is high within the thread, the dependency on the other threads is very low.
[/one_half]

[/columns]

[columns]

[one_half title=”3. High predictability to avoid branching”]

Computations need to be as predictable as possible, to get the highest speed-up. That means the code within the loops needs to have no or few branches. That is code without statements like if, while or switch. This is because GPUs work better if the whole processor does the same. So if you now have many threads which all do different things, then a CPU is still the best solution. Like for decreasing dependencies from step two, in many cases redesigning the algorithm can result in performing GPU-code.

[/one_half]

[one_half title=”4. Low Data-transport overhead”]

In step 1 you looked for repeated computations. In this last step we look at the ratio between computations and data-size.

If the computations per data-chunk is high, then using the GPU is a good solution. A simple way to find out if a lot of computations are done is to look at CPU-usage in the system monitor. The reason is that data needs to be transferred to and from the GPU, which takes time even with 3 to 6 GB throughput per second.

When computations per data-chunk is low, doubling of speed is still possible when OpenCL is used on CPUs. See the technical explanation how OpenCL on modern CPUs work and can even  outperform a GPU.

[/one_half]
[/columns]


Does it fit?

Found out OpenCL is right for you? Contact us immediately and we can discuss how we can make your software faster. Not sure? Request a code-review or Rapid OpenCL Assessment to quickly find out if it works.

Do you think openCL is not the solution, but still processing data at the limits of your system? Feel free to contact us, as we can give you feedback for free on how to solve your problem with other techniques.

More to read on our blog

OpenCL is supported on many CPUs and GPUs. See this blog article to have an extensive overview of hardware that supports OpenCL.

A list of application areas where OpenCL can be used is written down here.

Finally there is aso a series on parallel programming theories, which explain certain theories behind OpenCL.

Keep The Hardware Focus

The real Apu

If you buy a car, the first choice is not often the kind of fuel. You first select on the engine-properties, the looks, the interior, the brand and for sure the total cost of ownership. The costs can be a reason to choose for a certain type of fuel though. In the parallel computation world it is different. There the fuel (CUDA or OpenCL) is the first decision and then the hardware is chosen. I think this is wrong and therefore speak a lot about CUDA-vs-OpenCL, while I think NVidia is a good choice for a whole list of algorithms.

If we give advise during a consult, we want to give the best advice. In case of CUDA, that would be based on budget to go for Tesla or the latest GTX; in case of OpenCL we can give much better advice on hardware. But actually starting with the technique is the worst thing you can do: focus on the hardware and then pick the technique that suits best.

IMPORTANT. The following is for understanding some concepts and limits only! It is pure theoretically, so I don’t claim any real-world results. Also what not is taken into account is how well different processors handle control-instructions (for, while, if, case, etc), which has quite some influence on actual performance.

Continue reading “Keep The Hardware Focus”

Altera published their OpenCL-on-FPGA optimization guide

Altera-doc

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

https://www.youtube.com/watch?v=p25CVFMc-dk

Subjects

The following subjects and optimisation tricks are discussed:

  • FPGA Overview
  • Pipelines
  • Good Design Practices
  • Avoid Pointer Aliasing
  • Avoid Expensive Functions
  • Avoid Work-Item ID-Dependent Backward Branching
  • Aligned Memory Allocation
  • Ensure 4-Byte Alignment for All Data Structures
  • Maintain Similar Structures for Vector Type Elements
  • Optimization of Data Processing Efficiency
  • Specify a Maximum Work-Group Size or a Required Work-Group Size
  • Loop Unrolling
  • Resource Sharing
  • Kernel Vectorization
  • Multiple Compute Units
  • Combination of Compute Unit Replication and Kernel SIMD Vectorization
  • Resource-Driven Optimization
  • Floating-Point Operations
  • Optimization of Memory Access Efficiency
  • General Guidelines on Optimizing Memory Accesses
  • Optimize Global Memory Accesses
  • Perform Kernel Computations Using Constant, Local or Private Memory
  • Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • For many “good enough”
  • Faster code, the easy way.
  • Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Better caching can give more boost than using faster hardware
  • Software running in parallel is a first step to GPU-computing
  • Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  •  The highest performance is guaranteed
  • Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

  • Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
  • Code-review of CPU-code (Java, C, C++ and more).
  • Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

  • Find parallellizable computations
  • Give the fitness to run on GPUs
  • Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Architecture check-up
  • Data-transport measurements
  • Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

Using Qt Creator for OpenCL

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS     += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS     += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”

WebCL Widget for WordPress

webcl-widget-adminSee the widget at the right showing if your browser+computer supports WebCL?

It is available under the GPL 2.0 license and based on code from WebCL@NokiaResearch (thanks guys for your great Firefox-plugin!)

Download from WordPress.org and unzip in /wp-content/plugins/. Or (better), search for a new plugin: “WebCL”. Feedback can be given in the comments.

I’d like to get your feedback what features you would like to see in the next version.

Continue reading “WebCL Widget for WordPress”

OpenCL SPIR by example

SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

 

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

Interest in OpenCL

Since more than a year I have this blog and I want to show the visitors around the world. Why? Then you know where OpenCL is popular and where not. I chose an unknown period, so you cannot really reverse engineer how many visitors I have – but the nice thing is that not much changes between a few days and a month. Unluckily Google Analytics is not really great for maps (Greenland as big as Africa, hard to compare US states to EU countries, cities disappear at world-views, etc), so I needed to do some quick image-editing to make it somewhat clearer.

At the world-view you see that the most interest comes from 3 sub-continents: Europe, North America and South-East Asia. Africa is the real absent continent here, except some Arab countries and South-Africa only some sporadic visits from the other countries. What surprises me is that the Arab countries are among my frequent visitors – this could be a language-issue, but I expected about the same number of visitors as from i.e. China. Latin America has mostly only interest from Brazil.

Continue reading “Interest in OpenCL”

The Fastest Payroll System Of The World

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading “The Fastest Payroll System Of The World”

28 June: OpenCL course in Utrecht, NL

At 28 June 2011 StreamCompting will give a 1-day course on OpenCL in Utrecht. As it is quite new, the priced is reduced. Also if you want to learn CUDA or any other GPGPU-language, this course is also a good option for you. The most important thing about GPGPU are the concepts. In other words the “why” they chose to make GPGPU-languages ike this. In my course you will get it after a one-day training. Most of the day consists of lectures with a short lab-sessions. The training makes use of a unique block-method, so you learn the technique top-down and almost can fill in the spaces yourself. At least 2 years of thorough programming-experience in Java, C++ or Objective C is preferred, because of the level of the subjects. The following is discussed with the big why-question as leading:

[list1]

  • OpenCL debunked: getting to understand how OpenCL is engineered.
  • Algoritms: which can be sped-up with GPGPU/OpenCL and which not.
  • Architectures & Optimalisations: why does one OpenCL-program work better on one architecture and not on another.
  • Software-engineering: wrapper-languages, code re-use and integration in existing software.
  • Debugging: not the screenshots, but giving you insight in how the memory-models work.
[/list1]

The lab-sessions are very minimal; you get (fully documented) homework which you can do the subsequent week (with assistance via mail). If you prefer to have extensive lab-sessions, please inform to the possibilities. After the session and the homework you’ll be able to decide on your own what kind of software can be sped up by using OpenCL and which not. Als you will be able to integrate OpenCL into your own software and engineer OpenCL-kernels. Note that the advances you make depend heavily on your seniority in programming. If all attendees are Dutch, it is given in Dutch. Future sessions will be in other cities, so if you prefer to receive training more local or at your company, please ask for the possibilities.

If you want more information, contact us.

ERSA-NVIDIA award for “Best Young Entrepreneur”

ersa-logoStreamHPC supports the ERSA conference, 22-25 July in Las Vegas. At that conference there will be an award given to “Best Young Entrepreneur” and I’d like you to send in a proposal. The winner gets an NVIDIA Tesla K20!

Young entrepreneurs and academics with a great product/project are invited to present their solution. As the event draws around 2000 people, you get the attention needed to show-case your new company or research-group. Your solution does not need to be based on FPGAs or GPUs, as long as Von Neumann’s architecture is not in it.

Read the information below or directly go to the ERSA-NVIDIA awards-homepage.
“Von Neumann’s architecture lasted for 75 years.”
That genius can no longer lead us into the new age of computing that is upon us. This competition seeks to acknowledge those pioneers that are helping to build the new computing landscape”

Submission of Proposals for ERSA-NVIDIA award Candidates
Deadline: 6 May 2013 31 May 2013 – extended deadline!
Send proposals to org@ersaconf.org

The Award is devoted for entrepreneurs developing tools, advanced technologies and opportunities for supporting applications, both academic and commercial, across broad area of high-performance, embedded systems implemented as multicore systems and reconfigurable heterogeneous parallel processing systems.

The Award Committee includes:

Leading Universities

  •  Stanford University, USA, Prof. Michael Flynn
  •  Imperial College London, UK, Prof. Wayne Luk
  • Karlsruhe Institute of Technology, Germany, Prof. Joerg Henkel
  • Keio University, Japan, Prof. Hideharu Amano
  • Shanghai Jiao Tong University, China, Prof. Simon See

Leading Companies (tentative list):

  • NVIDIA, Can Ozdoruk, Product Manager
  • Altera, Steve Casselmanm, Principal Engineer
  • National Instruments, Hugo Andrade, Principal Architect

For more info go to: http://ersaconf.org/awards/

If you have any question, just ask them in the comments or send us an email.

Valgrind suppression file for AMD64 on Linux

valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux”

SDKs

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

OpenCL is growing fast and various architectures now support compute-acceleration. This means that you have a lot of choice to find the right solution for your algorithm.

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

Working

Possibly in the (near) future

Currently we are looking into:

  • Game Consoles
    • Nintendo Wii U dev – only vague rumours.
    • Sony Playstation 4 Orbis – strong rumours.
  • Movidius – has internal builds, but will only release on customer’s request.
  • Texas Instruments – support on C66x multicore DSPs (PDF source) and on their ARM-chips.
  • ST-Ericsson
 If you have more information, let us know.

Abandoned

 

Useful peripherals

When working with various devices, you might find the below tips useful.

ARM

wv-20110922113910

When working with those small cute computers, three things come in handy:

  • a HDMI-switch (or monitor with more HDMI-inputs).
  • A small keyboard+mouse which uses Bluetooth or only one USB-port. I use the Logitech-keyboard as shown at the right.
  • A network-switch with enough free ports. Even though most boards have WIFI, good internet proofs itself to be valuable.

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup

Example of modelled versus measured water activity ('effective' concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC
Example of modelled versus measured water activity (‘effective’ concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC

As we cannot use the performance results for most of our commercial projects because they contain sensitive data, we were happy that Dr. David Topping from the University of Manchester was so kind to allow us to share the data for the UNIFAC project. The goal for this project was simple: port the UNIFAC algorithm to the Intel XeonPhi using OpenCL. We got a total of 485x speedup: 3.0x for going from single-core to multi-core CPU, 53.9x for implementing algorithmic, low-level improvements and a new memory layout design, and 3.0x for using the XeonPhi via OpenCL. To remain fair, we used the 160x speedup from multi-core CPU in the title, not from serial code. Continue reading “Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup”

The CPU is dead. Long live the CPU!

Scene from Gladiator when is decided on the end of somebody’s life.

Look at the computers and laptops sold at your local computer shop. There are just few systems with a separate GPU, neither as PCI-device nor integrated on the motherboard. The graphics are handled by the CPU now. The Central Processing Unit as we knew it is dying.

To be clear I will refer to an old CPU as “GPU-less CPU”, and name the new CPU (with GPU included) as plain “CPU” or “hybrid Processor”. There are many names for the new CPU with all their own history, which I will discuss in this article.

The focus is on X86. The follow-up article is on whether the king X86 will be replaced by king ARM.

Know that all is based on my own observations; please comment if you have nice information.

Continue reading “The CPU is dead. Long live the CPU!”

Qt Hello World

The earlier blog-post was about how to use Qt Creator with OpenCL. The examples are all around Images, but nowhere a simple Hello World. So here it is: AMD’s infamous OpenCL Hello World in Qt. Thank’s to linuxjunk for glueing the parts together.



int main(int argc, char *argv[]) {

    // Define the kernel. Take a good look what it does.
    QByteArray prog(
    "#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enablen" 
    "__constant char hw[] = "Hello World from Qt!"; n" 
    "__kernel void hello(__global char * out) {n" 
    "  size_t tid = get_global_id(0); n" 
    "  out[tid] = hw[tid]; n" 
    "}n"
    );

     // Get a context on the first available GPU.
     QCLContext context;
     if (!context.create(QCLDevice::GPU))
         qFatal("Could not create OpenCL context");

     // Allocate 100 bytes of memory on the Host.
     size_t mem_size = 100;
     char* outH = new char[mem_size];
     // Allocate buffer on the Device.
     QCLBuffer outCL = context.createBufferHost(outH, sizeof(char) * mem_size,
                                                QCLMemoryObject::WriteOnly);

     // Compile program against device
      QCLProgram program = context.buildProgramFromSourceCode(prog);

     // Create a kernel object, tell it we are using the kernel called "hello".
     QCLKernel kernel = program.createKernel("hello");

     // Set the necessary global memory. In this case it is the buffer-size.
     kernel.setGlobalWorkSize(outCL.size(), 1);

     // Turn on profiling.
     QCLCommandQueue queue = context.commandQueue();
     queue.setProfilingEnabled(true);

     // Queue the kernel up to run.
     // Give it an argument which is the memory we allocated above.
     QCLEvent event = kernel(outCL);

     // Use the event object above to block until processing has completed.
     event.waitForFinished();

     // Timing only works with profiling on. runTime is unsigned.
     printf(" time '%u'n", event.runTime());

     // Read the results out of the shared memory area.
     outCL.read(outH, mem_size);

     // Write to screen.
     printf(" result = '%s'", outH);
}

Have fun!

Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

  • development of extreme fast software,
  • design of (faster) algorithms,
  • accelerating existing software, and
  • provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support

Screenshot from Intel’s “God Rays” demo

This article is still work-in-progress

Intel has just released its OpenCL bit CPU-drivers, version 2013 bèta. It has support for OpenCL 1.1 (not 1.2 as for the CPU) on Intel HD Graphics 4000/2500 of the 3rd generation Core processors (Windows only). The release notes mention support for Windows 7 and 8, but the download-site only mentions windows 8. Support under Linux is limited to 64 bits.

The release notes mention:

  • General performance improvements for many OpenCL* kernels running on CPU.
  • Preview Tool: Kernel Builder (Windows)
  • Preview Feature: support of  kernel source code hotspots analysis with the Intel VTuneT Amplifier XE 2011 update 3 or higher.
  • The GNU Project Debugger (GDB) debugging support on Linux operating systems.
  • New OpenCL 1.2 extensions supported by the CPU device:
    • cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics
    • cl_khr_fp16
    • cl_khr_gl_sharing
    • cl_khr_gl_event
    • cl_khr_d3d10_sharing
    • cl_khr_dx9_media_sharing
    • cl_khr_d3d11_sharing.
  • OpenCL 1.1 extensions that were changed in OpenCL 1.2:
    • Device Fission supports both OpenCL 1.1 EXT API’s and also OpenCL* 1.2 fission core features
    • Media Sharing support intel 1.1 media sharing extension and also the 1.2 KHR media sharing extension
    • Printf extension is aligned with OpenCL 1.2 core feature.

Check the release notes for full information.

The drivers can be found on http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk-2013/. Installation is simple. For Windows there is a installer. If you have Linux, make sure you remove any previous version of Intel’s openCL drivers. If you have a Debian-based Linux, use the command ‘alien’ to convert the rpm to deb, and make sure ‘libnuma1‘ is installed. There are requirements for libc 2.11 or 2.12 – more information on that later as Ubuntu 12.04 has libc6 2.15.

Continue reading “Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support”

Does GPGPU have a bright future?

This post has a focus towards programmers. The main question “should I invest in learning CUDA/OpenCL?”

Using the video-processor for parallel processing is actually possible since beginning 2006; you just had to know how to use the OpenGL Shader Language. Not long after that (end 2006) CUDA was introduced. A lot has happened after that, which resulted in the introduction of OpenCL in fall 2008. But actually the acceptance of OpenCL is pretty low. Many companies which do use it, want to have it as their own advantage and don’t tell the competition they just saved hundreds of thousands of Euros/Dollars because they could replace their compute-cluster with a single computer which cost them €10 000,- and a rewrite of the calculation-core of their software. Has it become a secret weapon?

This year a lot of effort will be put to integrate OpenCL within the existing programming languages (without all the thousands of tweak-options visible). Think about wizards around pre-built kernels and libraries. Next year everything will be around kernel-development (kernels are the programs which do the actual calculations on the graphics processor). The year after that, the peak is over and nobody knows it is built in their OS or programming-language. It’s just like current programmers use security-protocols, but don’t  know what it actually is.

If I want to slide to the next page on modern mobile phones, I just make a call to a slide-function. A lot is happening when the function is called, such building up the next page in a separate part of memory, calling the GPU-functions to show the slide, possibly unloading the previous page. The same is with OpenCL; I want to calculate a FFT with specified precision and I don’t want to care on which device the calculation is done. The advantage of building blocks (like LEGO) is that we keeps the focus of development on the end-target, while we can tweak it later (if the customer has paid for this extra time). What’s a bright future if nobody knows it?

Has it become a secret weapon?

Yes and no. Companies want to brass about their achievements, but don’t want the competitors to go the same way and don’t want their customers to demand lower prices. AMD and NVidia are pushing OpenCL/CUDA, so it won’t stop growing in the market, but actually this pushing is the biggest growth in the market. NVidia does a good job with marketing their CUDA-platform.

What’s a bright future if nobody knows it?

Everything that has market-wide acceptation has a bright future. It might be replaced by a successor, but acceptance is the key. With acceptance there always will be a demand for (specialised) kernels to be integrated in building blocks.

We also have the new processors with 32+ cores, which actually need to be used; you know the problem with dual-core “support”.

Also the mobile market is growing rapidly. Once that is opened for OpenCL, there will be a huge growth in demand for accelerated software.

My advise: if high performance is very important for your current or future tasks, invest in learning how to write kernels (CUDA or OpenCL, whatever your favourite is). Use wrapper-libraries which make it easy for you, because once you’ve learned how to use the OpenCL-calls they are completely integrated in your favourite programming language.