Altera published their OpenCL-on-FPGA optimization guide

Altera-doc

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

https://www.youtube.com/watch?v=p25CVFMc-dk

Subjects

The following subjects and optimisation tricks are discussed:

  • FPGA Overview
  • Pipelines
  • Good Design Practices
  • Avoid Pointer Aliasing
  • Avoid Expensive Functions
  • Avoid Work-Item ID-Dependent Backward Branching
  • Aligned Memory Allocation
  • Ensure 4-Byte Alignment for All Data Structures
  • Maintain Similar Structures for Vector Type Elements
  • Optimization of Data Processing Efficiency
  • Specify a Maximum Work-Group Size or a Required Work-Group Size
  • Loop Unrolling
  • Resource Sharing
  • Kernel Vectorization
  • Multiple Compute Units
  • Combination of Compute Unit Replication and Kernel SIMD Vectorization
  • Resource-Driven Optimization
  • Floating-Point Operations
  • Optimization of Memory Access Efficiency
  • General Guidelines on Optimizing Memory Accesses
  • Optimize Global Memory Accesses
  • Perform Kernel Computations Using Constant, Local or Private Memory
  • Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

Using Qt Creator for OpenCL

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS     += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS     += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • For many “good enough”
  • Faster code, the easy way.
  • Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Better caching can give more boost than using faster hardware
  • Software running in parallel is a first step to GPU-computing
  • Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  •  The highest performance is guaranteed
  • Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

  • Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
  • Code-review of CPU-code (Java, C, C++ and more).
  • Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

  • Find parallellizable computations
  • Give the fitness to run on GPUs
  • Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

  • Architecture check-up
  • Data-transport measurements
  • Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

Visit us (Amsterdam)

So we invited you over? Cool! See you soon!

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Poort business area. Below you’ll find information on how to get there.

The entry of the office building

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

RouteTravel time (outside rush hours)
Office – Schiphol15 minutes
Office – The Hague40 minutes
Office – Utrecht35 minutes
Office – Rotterdam50 minutes
Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

View in the direction of the office from the metro station

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

  • Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
  • Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
  • Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
  • Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
  • Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).
  • Subway: Line 50, also connecting to Amsterdam train-stations Lelylaan, Zuid, RAI and Bullewijk. In case there are problems with the train to Schiphol, go to Amsterdam Zuid and take the train from there.

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

Inside

When entering the front door, go to the right to find the elevators. There go to the 6th floor. From the elevators, go to the North. You’ll see our sign!

The entry of the office
Stream HPC has the blue marked office.

Valgrind suppression file for AMD64 on Linux

valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux”

SDKs

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

OpenCL is growing fast and various architectures now support compute-acceleration. This means that you have a lot of choice to find the right solution for your algorithm.

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

Working

Possibly in the (near) future

Currently we are looking into:

  • Game Consoles
    • Nintendo Wii U dev – only vague rumours.
    • Sony Playstation 4 Orbis – strong rumours.
  • Movidius – has internal builds, but will only release on customer’s request.
  • Texas Instruments – support on C66x multicore DSPs (PDF source) and on their ARM-chips.
  • ST-Ericsson
 If you have more information, let us know.

Abandoned

 

Useful peripherals

When working with various devices, you might find the below tips useful.

ARM

wv-20110922113910

When working with those small cute computers, three things come in handy:

  • a HDMI-switch (or monitor with more HDMI-inputs).
  • A small keyboard+mouse which uses Bluetooth or only one USB-port. I use the Logitech-keyboard as shown at the right.
  • A network-switch with enough free ports. Even though most boards have WIFI, good internet proofs itself to be valuable.

Qt Hello World

The earlier blog-post was about how to use Qt Creator with OpenCL. The examples are all around Images, but nowhere a simple Hello World. So here it is: AMD’s infamous OpenCL Hello World in Qt. Thank’s to linuxjunk for glueing the parts together.



int main(int argc, char *argv[]) {

    // Define the kernel. Take a good look what it does.
    QByteArray prog(
    "#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enablen" 
    "__constant char hw[] = "Hello World from Qt!"; n" 
    "__kernel void hello(__global char * out) {n" 
    "  size_t tid = get_global_id(0); n" 
    "  out[tid] = hw[tid]; n" 
    "}n"
    );

     // Get a context on the first available GPU.
     QCLContext context;
     if (!context.create(QCLDevice::GPU))
         qFatal("Could not create OpenCL context");

     // Allocate 100 bytes of memory on the Host.
     size_t mem_size = 100;
     char* outH = new char[mem_size];
     // Allocate buffer on the Device.
     QCLBuffer outCL = context.createBufferHost(outH, sizeof(char) * mem_size,
                                                QCLMemoryObject::WriteOnly);

     // Compile program against device
      QCLProgram program = context.buildProgramFromSourceCode(prog);

     // Create a kernel object, tell it we are using the kernel called "hello".
     QCLKernel kernel = program.createKernel("hello");

     // Set the necessary global memory. In this case it is the buffer-size.
     kernel.setGlobalWorkSize(outCL.size(), 1);

     // Turn on profiling.
     QCLCommandQueue queue = context.commandQueue();
     queue.setProfilingEnabled(true);

     // Queue the kernel up to run.
     // Give it an argument which is the memory we allocated above.
     QCLEvent event = kernel(outCL);

     // Use the event object above to block until processing has completed.
     event.waitForFinished();

     // Timing only works with profiling on. runTime is unsigned.
     printf(" time '%u'n", event.runTime());

     // Read the results out of the shared memory area.
     outCL.read(outH, mem_size);

     // Write to screen.
     printf(" result = '%s'", outH);
}

Have fun!

The CPU is dead. Long live the CPU!

Scene from Gladiator when is decided on the end of somebody’s life.

Look at the computers and laptops sold at your local computer shop. There are just few systems with a separate GPU, neither as PCI-device nor integrated on the motherboard. The graphics are handled by the CPU now. The Central Processing Unit as we knew it is dying.

To be clear I will refer to an old CPU as “GPU-less CPU”, and name the new CPU (with GPU included) as plain “CPU” or “hybrid Processor”. There are many names for the new CPU with all their own history, which I will discuss in this article.

The focus is on X86. The follow-up article is on whether the king X86 will be replaced by king ARM.

Know that all is based on my own observations; please comment if you have nice information.

Continue reading “The CPU is dead. Long live the CPU!”

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup

Example of modelled versus measured water activity ('effective' concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC
Example of modelled versus measured water activity (‘effective’ concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC

As we cannot use the performance results for most of our commercial projects because they contain sensitive data, we were happy that Dr. David Topping from the University of Manchester was so kind to allow us to share the data for the UNIFAC project. The goal for this project was simple: port the UNIFAC algorithm to the Intel XeonPhi using OpenCL. We got a total of 485x speedup: 3.0x for going from single-core to multi-core CPU, 53.9x for implementing algorithmic, low-level improvements and a new memory layout design, and 3.0x for using the XeonPhi via OpenCL. To remain fair, we used the 160x speedup from multi-core CPU in the title, not from serial code. Continue reading “Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup”

Our Team

StreamHPC is the best known company in GPU-computing (OpenCL/CUDA/HIP/SYCL). We are also active in related technologies like Cloud-computing, embedded development, algorithm-design, graphics development (OpenGL/VULKAN), Machine Learning, and HPC (OpenMP/MPI).

We are distributed between mainly Amsterdam, Budapest and Barcelona.

The developers, the heart of the company

The company consists of highly skilled developers and low-level performance engineers. We mostly manage ourselves, but always with help of the group. This way we have influence by showing ownership.

Each employee regularly shares their experience and checks the work of colleagues, to keep the standards high. This results in faster deliveries with higher quality of code, for which we’ve been complimented often.

Want to work at StreamHPC too? Check our jobs-page.

The Leads

The senior team deals with new directions/markets/strategies, training the employees and making sure the project teams get enabled. We use EOS to lead our company.

DepartmentLeadExtra info
BoardVincent Hindriksen + Maurizio CampeseContact via contact@streamhpc.com
HR + RecruitmentBerrak BasContact via jobs@streamhpc.com
Consultancy + ProjectsMaurizio Campese + Vincent HindriksenWe will recruit for this role in H1 2025
Operations + IntegratorMaurizio Campese
Sales + marketingVincent HindriksenContact via contact@streamhpc.com
Finance + LegalVincent Hindriksen
Open Standards
ITRobin Voetter + Balint SoproniWe are actively recruiting for this role

Hire the experts

On average we have our pipeline full for 3-6 months, but always reserve time for shorter projects (maximum a month).

Call +31 854865760 or mail to info@streamhpc.com or fill in the contact form to have a chat on how we can solve your software performance problems or do your software development.

OpenCL SPIR by example

SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

 

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

WebCL Widget for WordPress

webcl-widget-adminSee the widget at the right showing if your browser+computer supports WebCL?

It is available under the GPL 2.0 license and based on code from WebCL@NokiaResearch (thanks guys for your great Firefox-plugin!)

Download from WordPress.org and unzip in /wp-content/plugins/. Or (better), search for a new plugin: “WebCL”. Feedback can be given in the comments.

I’d like to get your feedback what features you would like to see in the next version.

Continue reading “WebCL Widget for WordPress”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”

28 June: OpenCL course in Utrecht, NL

At 28 June 2011 StreamCompting will give a 1-day course on OpenCL in Utrecht. As it is quite new, the priced is reduced. Also if you want to learn CUDA or any other GPGPU-language, this course is also a good option for you. The most important thing about GPGPU are the concepts. In other words the “why” they chose to make GPGPU-languages ike this. In my course you will get it after a one-day training. Most of the day consists of lectures with a short lab-sessions. The training makes use of a unique block-method, so you learn the technique top-down and almost can fill in the spaces yourself. At least 2 years of thorough programming-experience in Java, C++ or Objective C is preferred, because of the level of the subjects. The following is discussed with the big why-question as leading:

[list1]

  • OpenCL debunked: getting to understand how OpenCL is engineered.
  • Algoritms: which can be sped-up with GPGPU/OpenCL and which not.
  • Architectures & Optimalisations: why does one OpenCL-program work better on one architecture and not on another.
  • Software-engineering: wrapper-languages, code re-use and integration in existing software.
  • Debugging: not the screenshots, but giving you insight in how the memory-models work.
[/list1]

The lab-sessions are very minimal; you get (fully documented) homework which you can do the subsequent week (with assistance via mail). If you prefer to have extensive lab-sessions, please inform to the possibilities. After the session and the homework you’ll be able to decide on your own what kind of software can be sped up by using OpenCL and which not. Als you will be able to integrate OpenCL into your own software and engineer OpenCL-kernels. Note that the advances you make depend heavily on your seniority in programming. If all attendees are Dutch, it is given in Dutch. Future sessions will be in other cities, so if you prefer to receive training more local or at your company, please ask for the possibilities.

If you want more information, contact us.

Interest in OpenCL

Since more than a year I have this blog and I want to show the visitors around the world. Why? Then you know where OpenCL is popular and where not. I chose an unknown period, so you cannot really reverse engineer how many visitors I have – but the nice thing is that not much changes between a few days and a month. Unluckily Google Analytics is not really great for maps (Greenland as big as Africa, hard to compare US states to EU countries, cities disappear at world-views, etc), so I needed to do some quick image-editing to make it somewhat clearer.

At the world-view you see that the most interest comes from 3 sub-continents: Europe, North America and South-East Asia. Africa is the real absent continent here, except some Arab countries and South-Africa only some sporadic visits from the other countries. What surprises me is that the Arab countries are among my frequent visitors – this could be a language-issue, but I expected about the same number of visitors as from i.e. China. Latin America has mostly only interest from Brazil.

Continue reading “Interest in OpenCL”

ERSA-NVIDIA award for “Best Young Entrepreneur”

ersa-logoStreamHPC supports the ERSA conference, 22-25 July in Las Vegas. At that conference there will be an award given to “Best Young Entrepreneur” and I’d like you to send in a proposal. The winner gets an NVIDIA Tesla K20!

Young entrepreneurs and academics with a great product/project are invited to present their solution. As the event draws around 2000 people, you get the attention needed to show-case your new company or research-group. Your solution does not need to be based on FPGAs or GPUs, as long as Von Neumann’s architecture is not in it.

Read the information below or directly go to the ERSA-NVIDIA awards-homepage.
“Von Neumann’s architecture lasted for 75 years.”
That genius can no longer lead us into the new age of computing that is upon us. This competition seeks to acknowledge those pioneers that are helping to build the new computing landscape”

Submission of Proposals for ERSA-NVIDIA award Candidates
Deadline: 6 May 2013 31 May 2013 – extended deadline!
Send proposals to org@ersaconf.org

The Award is devoted for entrepreneurs developing tools, advanced technologies and opportunities for supporting applications, both academic and commercial, across broad area of high-performance, embedded systems implemented as multicore systems and reconfigurable heterogeneous parallel processing systems.

The Award Committee includes:

Leading Universities

  •  Stanford University, USA, Prof. Michael Flynn
  •  Imperial College London, UK, Prof. Wayne Luk
  • Karlsruhe Institute of Technology, Germany, Prof. Joerg Henkel
  • Keio University, Japan, Prof. Hideharu Amano
  • Shanghai Jiao Tong University, China, Prof. Simon See

Leading Companies (tentative list):

  • NVIDIA, Can Ozdoruk, Product Manager
  • Altera, Steve Casselmanm, Principal Engineer
  • National Instruments, Hugo Andrade, Principal Architect

For more info go to: http://ersaconf.org/awards/

If you have any question, just ask them in the comments or send us an email.

The Fastest Payroll System Of The World

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading “The Fastest Payroll System Of The World”

Shopper Bag

The page below is for explaining to non-engineers, who saw the link on the shopper-bag.

Back in university, engineers learn about the traveling salesman problem — finding the shortest route that visits every city on a list and returns to the start. It sounds simple, but solving it efficiently is incredibly complex. Even computers sweat over it.

Now imagine being in a supermarket with a mental shopping list and a rough map of the store layout. An engineer’s brain kicks in. Without even thinking, we’re solving a real-life shortest path problem. Every aisle, every detour, every item — calculated. Did you interrupt this person? You’ve just introduced chaos into an optimization process.

That’s what the bag is for. It’s not just about carrying groceries—it’s about keeping the system stable. 😄

The shopper problem.
A modern take on the traveling salesman problem:

Want to try it yourself?

  • Read the basics of the traveling salesman problem on Wikipedia.
  • Or try solving it online with this tool by the Technical University of Munich. Start with 3–5 cities for a challenge that won’t melt your brain.

Want the bag?

Let us know, and we’ll make sure you’re equipped for your next optimized grocery mission.

Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

  • development of extreme fast software,
  • design of (faster) algorithms,
  • accelerating existing software, and
  • provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

Does GPGPU have a bright future?

This post has a focus towards programmers. The main question “should I invest in learning CUDA/OpenCL?”

Using the video-processor for parallel processing is actually possible since beginning 2006; you just had to know how to use the OpenGL Shader Language. Not long after that (end 2006) CUDA was introduced. A lot has happened after that, which resulted in the introduction of OpenCL in fall 2008. But actually the acceptance of OpenCL is pretty low. Many companies which do use it, want to have it as their own advantage and don’t tell the competition they just saved hundreds of thousands of Euros/Dollars because they could replace their compute-cluster with a single computer which cost them €10 000,- and a rewrite of the calculation-core of their software. Has it become a secret weapon?

This year a lot of effort will be put to integrate OpenCL within the existing programming languages (without all the thousands of tweak-options visible). Think about wizards around pre-built kernels and libraries. Next year everything will be around kernel-development (kernels are the programs which do the actual calculations on the graphics processor). The year after that, the peak is over and nobody knows it is built in their OS or programming-language. It’s just like current programmers use security-protocols, but don’t  know what it actually is.

If I want to slide to the next page on modern mobile phones, I just make a call to a slide-function. A lot is happening when the function is called, such building up the next page in a separate part of memory, calling the GPU-functions to show the slide, possibly unloading the previous page. The same is with OpenCL; I want to calculate a FFT with specified precision and I don’t want to care on which device the calculation is done. The advantage of building blocks (like LEGO) is that we keeps the focus of development on the end-target, while we can tweak it later (if the customer has paid for this extra time). What’s a bright future if nobody knows it?

Has it become a secret weapon?

Yes and no. Companies want to brass about their achievements, but don’t want the competitors to go the same way and don’t want their customers to demand lower prices. AMD and NVidia are pushing OpenCL/CUDA, so it won’t stop growing in the market, but actually this pushing is the biggest growth in the market. NVidia does a good job with marketing their CUDA-platform.

What’s a bright future if nobody knows it?

Everything that has market-wide acceptation has a bright future. It might be replaced by a successor, but acceptance is the key. With acceptance there always will be a demand for (specialised) kernels to be integrated in building blocks.

We also have the new processors with 32+ cores, which actually need to be used; you know the problem with dual-core “support”.

Also the mobile market is growing rapidly. Once that is opened for OpenCL, there will be a huge growth in demand for accelerated software.

My advise: if high performance is very important for your current or future tasks, invest in learning how to write kernels (CUDA or OpenCL, whatever your favourite is). Use wrapper-libraries which make it easy for you, because once you’ve learned how to use the OpenCL-calls they are completely integrated in your favourite programming language.