Problem solving tactic: making black boxes smaller

Posted by Vincent Hindriksen on 6 March 2020

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

https://twitter.com/StreamHPC/status/1235895955569938432

Black boxes will never be transparent

Assumption is the mother of all mistakes
Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading →

What does it mean to work at Stream HPC?

High-performance computing on many-core environments and low-level optimizations are very important concepts in large scientific projects nowadays. Stream HPC is one of the market’s more prominent companies active in mostly North America and Europe.

As we often get asked how it is to work at the company, we’d like to give you a little peak into our kitchen.

What we find important

We’re a close-knitted group of motivated individuals, who get a kick out of performance optimizations and are experienced in programming GPUs. Every day we have discussions on performance. Finding out why certain hardware behaves in a certain manner when a specific computing load is applied. For instance why certain code is not as fast as theoretically promised, and then finding the bottlenecks by analyzing the device and finding solutions for removing those bottlenecks. As a team we make better code than we could ever do as individuals.

Quality is important for everybody on the team, which is a whole step further than “just getting the job done”. This has a simple reason: we cannot speed up code that is of low quality. This is also why we don’t use many tools that automatically do magic, as these often miss many significant improvements and don’t improve the code quality. We don’t expect AI to dully replace us soon, but once it’s possible we’ll probably be part of that project ourselves.

Computer science in general is evolving at a fast rate and therefore learning, is an important part of the job. Reading papers, finding new articles, discussing future hardware architectures and how they would affect performance, is very important. With every project, we have to gather as much data as possible using scientific publications, interesting blog posts and code repositories in order to be on the bleeding edge of technology for our project. Why use a hammer to speedup code, when you don’t know which hammer to use best?

Our team-culture

Personality of the team

We are all kind, focused on structured problem-solving, communicative about wins and struggles, focus on group-wins above personal gains, and all gamers. To have good discussions and have good disagreements, we seek people who are also open-minded. And we share and appreciate humor! If you want to know more about our culture, click here.

Tailored work environment

As we have all kinds of people in the team, who need different ways of recharging. One needs a walk, while somebody else needs a quiet place. We help each other on more than just work-related obstacles. We think that a broad approach on differences makes us understand how to progress to the next professional level the quickest. This is inclusivity-in-action, we’re proud of. Ow, and we have noise-canceling headphones.

Creating a safe place to speak up is critical for us. This helps us learn new skills and do things we never did before. And this approach helps well with all those who don’t have Asperger or ADHD at all, but need to progress without first fitting a certain norm.

Projects we do

Today we work on plenty of exciting projects and no year has been the same. Below is a page with projects we’re proud of.

https://streamhpc.com/about-us/work-we-do

Style of project handling

We use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”. Most notable are the management and documentation style.

Saying an engineer hates documentation and being managed because he/she is lazy is simply false. It’s because most management and documentation styles are far from optimal.

Pull-style management is where the tasks are written down by the team, based on the proposal. All these tasks are put into the task-list of the project, and then each team member picks the tasks that are a good fit. The last resort for the tasks that stay behind and have a deadline (being pushed) was only needed in a few cases.

All code (MR) is checked by one or two colleague, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

Where do we fit in your career?

Each job should get you forward, when done at the right moment. Question is when Stream HPC is the right choice.

As you might have seen, we don’t require a certain education. This is because a career is a sum, and an academic study can be replaced by various types of experience. The optimum is often both a study and the right type of experience. This means that for us, a senior can be a student and a junior can have been 20 years in the field.

So what is the “right type of experience”? Let’s talk about those who only have job-experience with CPUs. First, being hooked by performance, as primary interest, would be the first reason to get into HPC and GPGPU. Second, being good at C and C++ programming. Third, knowing algorithms and mathematics really well and can quickly apply them. Fourth, being a curious and quick learner, which shows by you having experimented with GPUs. This is also exactly what we test and check during the application procedure.

During your job you’ll learn anything around GPU-programming with a balance between theory and practice. Preparation is key in how we work, and this you will develop in many circumstances.

Those who left Stream HPC have gotten very senior roles, from team lead to CTO. With Stream HPC growing in size, the growth opportunities within the company are also increasing.

Make the decision for a new job

Would you like to work for a rapidly growing company of motivated GPU professionals in Europe? We seek motivated, curious, friendly people. If you liked what you read here, do check our open job positions.

OpenCL – the battle, part I

Posted by Vincent Hindriksen on 28 January 2010 with 2 Comments

Part I: the Hardware-companies and Operating Systems

(Part II will be about programming languages and software-companies, part III about the gaming-industry)

OpenCL is the new, but already de-facto standard of stream-computing; but how it got there so fast is somewhat strange. A few years ago there were many companies and research-groups seeing the power of using the GPU, such as:

Ageia technologies’ PhysX (acquired by nVidia)
Havok Physics (acquired by Intel)
Stanford University’s Brook
PeakStream (acquired by Google)
ATI’s Close-To-Metal (abandoned, ATI acquired by AMD)
nVidia’s CUDA
AMD’s Brook-extensions Brook+ (completely replaced by OpenCL)
IBM’s InfoSphere

And the fight is really not over, since we are talking about a big shift in the super-computing industry. Just think of IBM BlueGene, which will lose lots of market to nVidia and AMD. Or Intel, who hasn’t acquired a GPU-creator as AMD did. Who had expected the market to change this rigorous? If we’re honest, we could have seen it coming (when looking at the turbulence around PhysX and Havok), but “normally” this new techniques would be introduced slowly.

The fight is about market-shares. For operating-systems, the user wants to have their movies encoded in 20 minutes just like their neighbour. For HPC-computing, since clusters can be updated for a far lower price than was possible with the old-fashioned way; here it is mostly between Linux HPC and windows HPC (which still has a very small market-share), but also database-engines which rely on high-performance hardware/software.
The most to gain is in the processor-market. The extremely large consumer-market is declining since 2004, since most users do not need more than a netbook and have bought a separate gaming-computer for the more demanding games. We don’t only see Intel and AMD anymore, but IBM’s powerful Cell- en Power-processors, very power-efficient ARM-processors, etc. Now OpenCL could make it more interesting to buy an average processor and a good graphics-card, Intel (and AMD) have no choice then to take the battle with nVidia.

Background: Why Apple made OpenCL

Short answer: pure frustration. All those different implementations would or get a share or fight for being named the standard; Apple wanted to bet on the right horse and therefore took the lead in creating an open standard. Money would be made by updating software and selling more hardware. For that reason Apple’s close partners Intel and nVidia were easily motivated to help developing the standard. Currently Apple’s only (public) reasons for giving away such an expensive and specialised project is publicity and to be ahead of the competition. Since it will not be a core-business of Apple, it does not need to stay in lead, but which companies do?

Acquisitions, acquisition, acquisitions

No time to lose for the big companies, so they must get the knowledge in-house as soon as possible. Below are some examples.

Microsoft: Interactive Supercomputing (22-Sept-2009): made Star-P, software which allowed users to perform scientific, engineering or analytical computation on array or matrix-based data to use parallel architectures such as multi-core workstations, multi-processor systems, distributed memory clusters or utility/cloud-based environments. This is completely in the field of OpenCL, which Microsoft needs to strengthen its products as Apple already did, such as SQL-server and Windows HPC.
nVidia: Ageia technologies (22-Febr-2008): made specialized PC-cards and software for calculating complicated physics in games. They made the first commercial product aiming at the masses (gamers). PhysX-code could by integrated in nVidia-drivers to be used with modern nVidia-GPUs.
AMD: ATI (24-juli-2006): graphics chip specialist. Although the price was too high, it saved AMD from being bought out by Intel and even stay ahead (if they had kept running).
Intel: Havok (17-Sept-2007): builds games-tools, such as a physics-engine. After Ageia was captured, the only good company out there to buy; AMD was too late, which spent all its money on ATI. Wind River (4-June-2009): a company providing embedded systems, development tools for embedded systems, middleware, and other types of software. Also read this interesting article. Cilk (31-July-2009): offers parallel extensions that are tightly tied into a compiler. RapidMind (19-Aug-2009): created a high-level language Sh, which had an OpenCL-backend. Intel has a lead in CPU-compilers, which it wants to broaden to multi-core- and GPU-compilers. Intel discovered it was in the group of “old fashioned compiler-builders” and had lots to learn in a short time.

If you know more acquisitions of interest, please let us know.

Winners

Apple, Intel and NVidia are the winners for 2009 and 2010. They have currently the most knowledge in house and have their marketing-machine running. NVidia has the best insight for new markets.

Microsoft and Game-developers are second; they took the first train by joining the OpenCL-consortium and taking it very serious. At the end of 2010 Microsoft will be at Apple’s level of expertise, so we will see then who has the best novelties. The game-developers, of which most already have experience with physics-calculations, all had a second chance when they had misjudged the Physics-engines. More on gaming in part III.

AMD is currently actually a big loser, since it does not seem to take it all seriously enough. But AMD can afford to be late, since OpenCL makes it easy to switch. We hope the best for AMD, since it has the technology of both CPU and GPU, and many years of experience in both fields. More on the competition between marketing-monster nVidia and silent AMD will be discussed in a blog-item, next week.

Another possible loser is Linux, which has lots to lose on HPC-market; OpenBSD-based Apple and Windows HPC can actually win market-share now. Expect most from hardware-manufacturers Intel, AMD and nVidia to give code to the community, but also from universities who do lots of research on the ever-flexible Linux. At the end it all depends on OpenCL-adaptation of (Linux-specific) programming-languages, which will be discussed in part II.

ARM is a member of the OpenCL-group but does not seem to invest in it; they seem to target another growing market: the low-power mobile devices. We will write on OpenCL and the mobile market later and why ARM currently can be relaxed about OpenCL.

We hope you have more insights in this new market; please contact us for more specific information and feel free to give your comments. Please stay tuned for part II and III, which will be released the next few weeks.

Other resources

The following information is still incomplete, since not all my own bookmarks are here. Please contact us if something is missing.

OpenCL Companies

See the SDK-pages for more.

Forums

http://libsh.org/index.html

PDFs of Monday 29 August

Posted by Vincent Hindriksen on 29 August 2011

This is the first PDF-Monday. It started as I used Mondays to read up on what happens around OpenCL and I like to share with you. It is a selection of what I find (somewhat) interesting – don’t hesitate to contact me on anything you want to know about accelerated software.

Parallel Programming Models for Real-Time Graphics. A presentation by Aaron Lefohn of Intel. Why a mix of data-, task-, and pipeline-parallel programming works better using hybrid computing (specifically Intel processors with the latest AVX and SSE extensions) than using GPGPU.

The Practical Reality of Heterogeneous Super Computing. A presentation of Rob Farber of NVidia on why discrete GPUs has a great future even if heterogeneous processors hit the market. Nice insights, as you can expect from the author of the latest CUDA-book.

Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method (Thales Luis Rodrigues Sabino, Marcelo Zamith, Diego Brandâo, Anselmo Montenegro, Esteban Clua, Maurício Kischinhevksy, Regina C.P. Leal-Toledo, Otton T. Silveira Filho, André Bulcâo). GPU based cluster environment for the development of scalable solvers for a 3D wave propagation problem with finite difference methods. Focuses on scattering sound-waves for finding oil-fields.

Parallel Programming Concepts – GPU Computing (Frank Feinbube) A nice introduction to CUDA and OpenCL. They missed task-parallel programming on hybrid systems with OpenCL though.

Proposal for High Data Rate Processing and Analysis Initiative (HDRI). Interesting if you want to see a physics project where they did not have decided yet to use GPGPU or a CPU-cluster.

Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers (Naoya Maruyama, Tatsuo Nomura, Kento Sato and Satoshi Matsuoka). A collection of macros for GPGPU, tested on TSUBAME2.

Creative industry

Desktops

When working with CAD-software, we tend to need a lot of rendering (ray-trace animations) and extensively use photo-editors. This means that there will be a huge speed-up when using OpenCL. In case of a little mistake in the final rendering, there should not be a question to ignore the detail or go for quality. Now you can just re-render and still make the deadline. In most cases the extra processing-power is needed for all employees and thus the best option is to upgrade the software. You can consult your software-supplier for more information.

But what is most interesting is to get the hardware upgraded. On Apple-computers, unluckily, the support for the stream-processors is lacking. We’ll just have to wait until NVidia and AMD listens to the growing group of OpenCL-demanding users on Apple-computers. Contact us, if you want to be the first to know when this will be possible!

On PCs we can upgrade the computers to have up to 4 stream-processors and thereby provide up to 5 teraFlops of computing power. This will result in real-time rendering of normal resolution images and 50 times speed-up over high-resolution images. We don’t think the efficiency will increase through less lost hours (because creativity happens inside the head when drinking coffee), but it will certainly increase the end-quality because more possibilities have been tried out and the creator was exposed to more visual feedback.

Render farms

When rendering movies and other high quality, high resolution visual material, a single Desktop might not be sufficient. Our default solution is a render-farm (a cluster of at least 5 servers and 1 control-computer) with drQueue. We’re familiar with those Pixar-movies stories that took 3 years to finish rendering. With OpenCL this can be brought back to 3 or 4 months, even with higher demands. Most movies with less complex materials (such as hair) can actually be rendered faster than real-time.

Supporting OpenCL on your own hardware

Posted by Vincent Hindriksen on 2 April 2012

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware” →

PDFs of Monday 16 April

Posted by Vincent Hindriksen on 16 April 2012

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April” →

Porting CUDA to OpenCL

OpenCL-in-CUDA-car — OpenCL speed-meter in 1970 Plymouth Cuda car

Why port your CUDA-acelerated software to OpenCL? Simply, to make your software also run on AMD CPU/APU/GPU, Intel CPU/GPU, Altera FPGA, Xilinx FPGA, Imagination PowerVR, ARM MALI, Qualcomm Snapdragon and upcoming architectures.

And as OpenCL is an open standard, supported by many vendors, it has much more security that it will keep existing in the future than any proprietary language.

If you look at the history of GPU-programming you’ll find many frameworks, such as BrookGPU, Close-to-Metal, Brook+, Rapidmind, CUDA and OpenCL. CUDA was the best choice from 2008 to 2013, as OpenCL had to catch up. Now that OpenCL is gaining serious market traction, the demand for porting legacy CUDA-code to OpenCL rises – as we clearly notice here.

We are very experienced in porting legacy CUDA-code to all flavours of OpenCL (CPU, GPU, FPGA, embedded). Ofcourse porting from OpenCL to CUDA is also possible, as well as updating legacy CUDA-code to the latest standards of CUDA 7.0 and later. We can also add several improvements to the architecture; we have made many customers happy with giving them more structured and documented code, while working on the port. Want to see some work we did? We ported molecular dynamics software Gromacs from CUDA to OpenCL.

[button text=”Request a pilot, code-review or more information” url=”https://streamhpc.com/consultancy/request-more-information/” color=”orange” target=”_blank”]

XilinX FPGA

Not much news yet. This is on their webpage:

Software-based system realization with C/C++ and OpenCL

Xilinx is currently working with early customers on a new system level, heterogeneous parallel programming environment that leverage abstractions such as C/C++ and Open Computing Language (OpenCL®), in a comprehensive Eclipse-based development environment.

This environment provides market-specific libraries to significantly improve productivity of verified heterogeneous systems with Xilinx All Programmable devices and is architected to empower system architects, SW application developers, and embedded designers who require a parallel architecture, to increase system performance, BOM cost reductions and total power reduction with development time in line with ASSP, DSPs, and GPUs.

And a message you should contact their sales-team to learn more.

Basic concepts: malloc in the kernel

Posted by Vincent Hindriksen on 30 November 2013

22489954_ml — Pointers and allocated memory space with a hint to Oktoberfest.

During the last training I got a question how to do malloc in the kernel. It was one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or global) memory from within the kernel. Luckily it’s possible, but it is somewhat hidden in another function.

clSetKernelArg to the rescue

The way to do it is from the host, using one of the kernel arguments.

cl_int clSetKernelArg ( cl_kernel kernel,

cl_uint arg_index,

size_t arg_size,

const void *arg_value)

This function allocates the memory on the device for you. Just as with normal malloc, it doesn’t clear the memory for you.

To make sure the host cannot access it (and you don’t accidentally pin/write/read it, when using host-generation scripts), you can use a flag for that: CL_MEM_HOST_NO_ACCESS. All the flags have been explained in a previous article about this same function, setting flags for creating kernel arguments.

The advantage of only allowing malloc to be done from the host, before the kernel is launched, is that the memory-planning can be done more efficiently.

Local memories

When you need a local space, you can specify that at the kernel-side. For example:

__kernel void foo(__local int* bar) { ... }

This mallocs an area in all local memories with size specified by arg_size.

Basic Concepts

This short article is in the basic concept series. It contains several subjects I did not see well-enough explained in books or the reference manual. If you see a subject that you would like to see in this series, just contact us.

Visit us (Amsterdam)

For the Budapest office, please contact us to make an appointment.

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Port business area. Below you’ll find information on how to get there, what is in the surroundings and which hotels are used by ourselves and guests.

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

Route	Travel time (outside rush hours)
Office – Schiphol	15 minutes
Office – The Hague	40 minutes
Office – Utrecht	35 minutes
Office – Rotterdam	50 minutes

Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

*View in the direction of the office from the metro station*

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bolt and Uber both operate in Amsterdam. Best to pick Bolt, if you have both.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

Inside

When entering the front door, go to the right to find the elevators. There go to the 6th floor. From the elevators, go to the North. You’ll see our sign!

*Stream HPC has the blue marked office.*

Surroundings

The park is a few hundred meters to the north-west, under the bridge. Supermarket (Albert Heijn) is a hundred meters to the east, halfway to the metro and train.

Hotels

These Hotels are nearby:

WestCord Fashion Hotel Amsterdam. Next to the Rembrandtpark and a few minutes walk from the office. It has a roof-bar and restaurant.
Bastion Hotel Amsterdam Zuidwest. More budget-friendly next to the Fashion Hotel.

A bit further away:

XO Hotels Couture. Near the local shops, Chinese restaurant and supermarket.
Met Hotel Amsterdam. Cozy hotel with restaurant downstairs. Also near local shops.

Hotels closer to the center require taxi (or bike). But due to traffic, best not to use these. We’ll add more hotels, when we did not hear negative experiences about and checked them ourselves.

GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA

Posted by Vincent Hindriksen on 18 April 2015 with 2 Comments

Wrong! — In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.

A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).

Questions I try to answer:

Which server-grade GPUs support direct GPU-GPU communication when using OpenCL?
What are other characteristics interesting for OpenCL-devs besides direct communication GPU-GPU, GPU-FPGA, GPU-NIC?
How do you code such fast communication?

Enjoy reading! Continue reading “GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA” →

Kalray MPPA

Kalray has one recent processor, which is the MPPA 256. The design is somewhat comparable to the Parallella processor, but instead of 16 or 64 cores, it has a staggering 256 while operating under a few Watts. The architecture is not completely comparable, but the two share important characteristics. The MPPA-256 has its strongest positioning in task-parallel programming, which makes it capable to do more complex algorithms like video-encoding.

During HiPEAC’14 they announced they were developing an OpenCL compiler, which was targeted for end 2014. Meanwhile more information reached us, and the project seems to be in very active development. Due to MPPA’s characteristics, the focus is on the task-parallel programming model of OpenCL – we did not give this programming model much attention on the blog, but now we certainly will.

Until Q1 2015 the SDK is available via a closed beta program. Please contact Kalray for more information on the MPPA processor and the OpenCL compiler.

Porting code that uses random numbers

Posted by Michael Noisternig on 18 August 2016

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers” →

OpenCL in the cloud – API beta launching in a month

Posted by Vincent Hindriksen on 31 October 2015

We’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

We more than halved the FPGA development time by using OpenCL

Posted by Vincent Hindriksen on 31 October 2015

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

We outperform traditional FPGA development companies in time-to-market and price.
We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
Our software also works on CPUs and GPUs for no additional charge.
The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

Machine Learning

Machine learning is increasingly employed in computing tasks where it is infeasible to design an explicit algorithm due to the high dimensionality of the input space and the overall complexity of the problem. Algorithms for machine learning build up a model from example inputs and continuously refine this model based on some form of feedback over many training steps. Learning is often either supervised or unsupervised, and in both cases is very time-consuming. Using our expertise in parallel programming, we can speed up your machine learning algorithms to significantly increase learning rates and thus the quality of your algorithms. For example, we could help one of our customers by reducing the training times of its artificial neural network to a tenth of the time, which translated to a better quality of the customer’s analysis software.

We can also consult you in whether your algorithm is suitable for high speedups or whether a different algorithm may better benefit from parallelization. Contact us to find the best solution for you.

Performance Tuning

When your custom software doesn’t have the needed performance, it often can be fixed. After a performance assessment we’ll flex our muscles to make your code as fast as needed.

Ofcourse the total costs are higher to fix afterwards, so we suggest to get in contact early on the project is performance is a priority.

1-day Crash Course

Throughout Europe we give crash courses in OpenCL. After an investment of €500 ($600) and one day you will know:

The models used to define OpenCL.
If OpenCL is an option for your project.
How to read and understand OpenCL code.
Code simple OpenCL programs.
Differences between CPUs, GPUs and FPGAs.

Crash courses are intended to get you in contact with programming accelerators, and don’t replace a training.

If you are interested to get a certain crash course in another city and time than shown below, fill in this form to get notified when the crash course of your city of choice.

[eme_events category=4]

IWOCL 2017 Toronto call for talks and posters is open

Posted by Vincent Hindriksen on 30 December 2016

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.