SDKs

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

OpenCL is growing fast and various architectures now support compute-acceleration. This means that you have a lot of choice to find the right solution for your algorithm.

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

!!!THESE PAGES WILL BE MOVED TO OPENCL.ORG!!!

Working

Possibly in the (near) future

Currently we are looking into:

  • Game Consoles
    • Nintendo Wii U dev – only vague rumours.
    • Sony Playstation 4 Orbis – strong rumours.
  • Movidius – has internal builds, but will only release on customer’s request.
  • Texas Instruments – support on C66x multicore DSPs (PDF source) and on their ARM-chips.
  • ST-Ericsson
 If you have more information, let us know.

Abandoned

 

Useful peripherals

When working with various devices, you might find the below tips useful.

ARM

wv-20110922113910

When working with those small cute computers, three things come in handy:

  • a HDMI-switch (or monitor with more HDMI-inputs).
  • A small keyboard+mouse which uses Bluetooth or only one USB-port. I use the Logitech-keyboard as shown at the right.
  • A network-switch with enough free ports. Even though most boards have WIFI, good internet proofs itself to be valuable.

Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Continue reading “Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P”

The entanglement of Bitcoins and compute-capabilities

Every now and then I read stories on Bitcoins (Wikipedia-article), as GPUs are used a lot to “mine” Bitcoins. They have some extensive benchmarks, and also their discussions giving me insights in specific parts of accelerators like GPUs. Also is this group very upwards if it comes to accepting new techniques. Today something changed: they are a bank now. One of the thoughts I had with this, I’d like to share with you.

If you look at various types of currencies, you see they all have various goals (trade, power, resources, energy, properties, etc). The inequality and differences are even more important than the amount. Various currencies are entangled to a certain goal or resource, but there is nothing entangled strongly to technology. Here is where Bitcoins come in…

Bitcoins are entangled with compute-power – a current benchmark for technological progress.

In this article I’d like to share how the tech-economy and Bitcoins are entangled, seen from the perspective of computing. I left out a lot of the “rules of economy” and hope you can put these in – the below text is just to guide you through the thought-process only. Disagreement is only good – as we learn all from it.

Continue reading “The entanglement of Bitcoins and compute-capabilities”

Applied GPGPU-days Amsterdam 2013

6754632287-2December 2013: Videos are not ready yet, but link will be put here.

Amsterdam, 20 June – Applied GPGPU-days in Amsterdam. Keep your agenda free for this event.

What can you do with GPUs to speed up computations? This year we can see various examples where OpenCL and CUDA have been used. We hope to give you an answer if you can use GPUs for your software, research or algorithm.

After the success of last year (fully booked with 66 attendees), we now have reserved a larger location with place for 100 people. Difference with last year is that we focus more on applications, less on technical aspects.

The program has been made public recently:

Title of talk Company/Institute Presenter
Introduction to GPGPU and GPU-architectures StreamHPC Vincent Hindriksen
Blender Cycles & Tiles: Enhancing user experience AtMind bv Monique Dewanchand & Jeroen Bakker
XeonPhi vs K20: The fight of the titans SURFsara Evghenii Gaburov
A real-time simulation technique for ship-ship and ship-port interaction PMH bv Jo Pinkster
CUDA Accelerated Neural Networks LIACS Ana Balevic
Efficient Reconstruction of Biological Networks via Transitive Reduction on GPUs TU Eindhoven Anton Wijs
Running Petsc on GPUs with an example from fluid dynamics SURFsara Thomas Geenen
Connected Component Labelling, an embarrassingly sequential algorithm Leeuwarden University Jaap van de Loosdrecht
Visualizing sound and vibrations using a GPU and a 1024-channel microphone array TU Eindhoven Wouter Ouwens
Gravitational N-body simulations on 1 to many GPUs Leiden observatory Jeroen Bédorf

A few demos will be shown.

For more information, see the Platform Parallel webpage. Also to find other events by the platform.

Tickets are €75,-. If you are from a Dutch university or research institute affiliated with SURF, your ticket has been fully sponsored by SURFsara.

Associated events in the Netherlands

For the technical aspects (GPU-programming techniques, optimisation, etc) we have a special day: the GPU Dev Day 2013. More information on the Platform Parallel webpage. Date and place will be made public in June.

The first Khronos Meetup Benelux will take place just before the Applied GPGPU day, on 19 June in Amsterdam. More information on the meetup-page.

First Khronos Chapter meeting in Amsterdam: WebGL/OpenGL

highres_225764012.jpeg

Thursday 13 February 2014 the first Khronos meetup in will take place. We expect a small group, so the location will be cozy and there will be enough time to talk with a beer. First round is on me, admission is free.

Goal is to learn about open media-standards from Khronos and others. So when OpenCV is discussed, we’ll also talk OpenVX. The target group is programmers and Indy developers who are interested in creating multi-OS and multi-device software.

Program

I am very thrilled to tell that Ton Roosendaal of the Blender Foundation will talk about the releationship between his Blender and Khronos OpenGL.

Second Maarten and Jurjen of ThreeDee Media will talk about WebGL, from a technical and a market view. Is WebGL ready for prime-time?

Then you can show your stuff. For that I’ll bring a good laptop with Windows 8.1 and Ubuntu 13.10 64.

Prepare for Meetup today!

See the Meetup-page for more information. See you there!

 

 

 

Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is  the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Valgrind suppression file for AMD64 on Linux

valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux”

Market Positioning of Graphics and Compute solutions

positioningWhen compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

AMD NVIDIA Intel ARM
Desktop User * A-series APU  – Iris / Iris Pro  –
Laptop User * A-series APU  – Iris / Iris Pro  –
Mobile User  – Tegra Iris Mali T720 / T4xx
Desktop Gamer Radeon GeForce  –  –
Laptop Gamer Radeon M GeForce M  –  –
Mobile High-end  – Tegra K (?) Iris Pro Mali T760 / T6xx
Desktop Graphics FirePro W Quadro  –  –
Laptop Graphics FirePro M Quadro M  –  –
Desktop (DP) Compute FirePro W Titan (hdmi) / Tesla (no video-out) XeonPhi  –
Laptop (DP) Compute FirePro M Quadro M XeonPhi  –
Server (DP) Compute FirePro S Tesla XeonPhi (active cooling!)  –
Cloud Sky Grid  –  –

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on Hardware.info:

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

  1. Double Precision GFLOPS
  2. Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
  3. Faster support
  4. (Free) Developer Tools
  5. System Configuration Software (click-click and compute works)
  6. Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
  7. Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.

Conclusion

Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

We more than halved the FPGA development time by using OpenCL

fast-fpga
A flying FPGA board

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

  • We outperform traditional FPGA development companies in time-to-market and price.
  • We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
  • Our software also works on CPUs and GPUs for no additional charge.
  • The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

Image Processing

Vd-SharpVd-Blur2Vd-Edge3At StreamHPC, there is broad experience in the parallel, high-performance implementation of image filters. We have significantly improved the performance of various image processing software. For example, we have supported Pixelmator in achieving outstanding processing speeds on large image data, and users frequently praise the software’s speed in comparisons with competing software products.

StreamHPC is currently hosting an educational initiative that supports interested individuals in their efforts of porting algorithms from the open-source GEGL image processing framework to fast parallel versions based on OpenCL. GEGL is used by the popular image manipulation software Gimp as well as other free software. For more information on this project, look at our website OpenCL.org, which we dedicate to spreading knowledge on OpenCL.

Computer Vision

Face_detectionComputing demands in computer vision are high, and often real-time processing with low latency is desirable. Computer vision can greatly benefit from parallelization as higher processing speeds can improve object recognition rates while FPGA solutions may reduce energy demands or support the perception of lag-free processing. At StreamHPC, we have supported several customers in optimizing their software to work on a lower power budget and on a higher speed. We can support you in dedicated solutions based on GPUs or FPGAs to meet your demands.

Performance can be measured as Throughput, Latency or Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

What is Khronos as of today?

The Khronos Group is the organization behind APIs like OpenGL, Vulkan and OpenCL. Over one hundred companies are a member and decide together what your next year phone, camera, computer or media device will be capable of.

We’re at the right, near the bottom.

We work most with OpenCL, but you probably noticed we work with OpenGL, Vulkan and SPIR too. Currently they have the following APIs:

  • COLLADA, a file-format intended to facilitate interchange of 3D assets
  • EGL, an interface between Khronos rendering APIs such as OpenGL ES or OpenVG and the underlying native platform window system
  • glTF, a file format specification for 3D scenes and models
  • OpenCL, a cross-platform computation API.
  • OpenGL, a cross-platform computer graphics API
  • OpenGL ES, a derivative of OpenGL for use on mobile and embedded systems, such as cell phones, portable gaming devices, and more
  • OpenGL SC, a safety critical profile of OpenGL ES designed to meet the needs of the safety-critical market
  • OpenKCam, Advanced Camera Control API
  • OpenKODE, an API for providing abstracted, portable access to operating system resources such as file systems, networks and math libraries
  • OpenMAX, a layered set of three programming interfaces of various abstraction levels, providing access to multimedia functionality
  • OpenML, an API for capturing, transporting, processing, displaying, and synchronizing digital media
  • OpenSL ES, an audio API tuned for embedded systems, standardizing access to features such as 3D positional audio and MIDI playback
  • OpenVG, an API for accelerating processing of 2D vector graphics
  • OpenVX, Hardware acceleration API for Computer Vision applications and libraries
  • OpenWF, APIs for 2D graphics composition and display control
  • OpenXR, an open and royalty-free standard for virtual reality and augmented reality applications and devices
  • SPIR, a intermediate compiler target for OpenCL and Vulkan
  • StreamInput, an API for consistently handling input devices
  • Vulkan, a low-overhead computer graphics API
  • WebCL, a JavaScript binding to OpenCL within a browser
  • WebGL, a JavaScript binding to OpenGL ES within a browser on any platform supporting the OpenGL or OpenGL ES graphics standards

Too few people understand that the organization is very unique, as the biggest processor vendors are discussing collaborations and how to move the market, while they’re normally the fiercest competitors. Without Khronos it would have been a totally different world.

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

Hello

Welcome to the webpage of Stream HPC. We’re a company in Europe that work on solving the most difficult HPC problems with emphasis on scaling to GPUs and clusters. We have built up experience in speeding up software, designing performance oriented architectures, writing maintainable low-level code, selecting the best hardware for the job, and building benchmarks. Above all, we’re a customer oriented company, as we want our clients to feel in control, while we do that heavy lifting.

The company is multi-cultural and designed to be a safe space for everybody of our team – from LBGT+ to Asperger’s, we focus on making our differences our strengths. As you can read in the job self-assessment, we have 4 main strengths:

  • CPU development: algorithms, low-level code, architectures for CPU-based software. This includes clusters.
  • GPU development: algorithms, low-level code, architectures for GPU-based software. This includes graphics programming
  • Problem-solving: get from full understanding to full exploration quickly.
  • Self-managed teams: we don’t hire managers, but provide frameworks.

Our customers are all around the world, but especially North-America, West-Europe and East-Asia. We have built many high performance software that run from edge-computers to super-computers. See “What we do” for examples.

Our offices are in:

  • Amsterdam
  • Budapest
  • Barcelona

If you want to know more, feel free to get in contact.

See this page for Netherlands/Belgium, Hungary or Spain.

Let’s enter the Top500 HPC list using GPUs

The #500 super-computer has only 24 TFlops (2010-06-06): http://www.top500.org/system/9677

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

  • A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
  • A CPU with at least 4 cores.
  • Focus on the bandwidth, else we will not be able to use all power.
  • Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.

MPI in terms of OpenCL

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Continue reading “MPI in terms of OpenCL”

Targetting various architectures in OpenCL and CUDA

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

  • Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Continue reading “Targetting various architectures in OpenCL and CUDA”

Scaling mobile GPUs to 1000 GFLOPS

arm_mali_cover_151112297646_640x360On the 20th of April 2013 there was an interesting discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage). David is a specialist in CPU and GPU architectures (twitterhomepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, but there was something about this one that made me want to share it with special attention.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash. http://www.fpgacpu.org/papers/Gray_AutumnOfMooresLaw_SingularityUniversity_11-06-23.pdf

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell, all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Continue reading “Scaling mobile GPUs to 1000 GFLOPS”

About Us

Stream HPC  is a software development company in parallel software for many-core processors. We provide professional software development services, training and consulting to help you increase compute performance in software while lowering hardware-costs.

We have 3 locations.

Stream HPC B.V. (Amsterdam)

Koningin Wilhelminaplein 1 – 40601
1062 HG Amsterdam
Netherlands, Europe

phone: +31 854865760 (office) or +31 6 45400456 (cell)

Visit us in Amsterdam

Stream HPC Hungary Kft. (Budapest)

Science Park
1117 Budapest
Irinyi József u. 4-20.
Hungary, Europe

Stream HPC Spain S.L. (Barcelona)

Plaza de Catalunya 1, 4th floor
Barcelona 08002
Spain, Europe

History

2010 – 2013: the freelancing years

The company started as a freelancing business, with one focus: Programming GPUs with OpenCL. It was though, as back then the G in GPU stood for “Graphics only”.

The name was “StreamComputing” = A high-performance computer system that analyzes multiple data streams from many sources live. The main goal was to create software algorithms that analyze the data in real time as it streams in to increase speed and accuracy when dealing with data handling and analysis, which was in line with that.

2014: first hope

Four years later the first employee, Anca, was hired. Later that year the freelancing business was was turned into a limited company. GPUs got more seen as data-processors and trainings were the main income. Projects were still small, GPGPU was a world of early adopters and most time was invested on trainings.

First contact was made with AMD, now one of our biggest clients.

2015-2017: initial growth

Stream grew to a handful of employees, and we did projects for HSA foundation, Stanford, AMD, Zeiss, Nokia, Philips and many lesser known companies.

Trainings were still done, but were by far not the main resource of income anymore. We tried some FPGA-work, but found that most promises were not implemented yet.

2017: a new name

We renamed the company to Stream HPC. There were several reasons. As we focused more on customers from Asia and North America, we needed the .com, which was unavailable. Getting the new name was quite a quest, but we got there: by customers we were often referred to as “Stream”, a business coach assured us that CPU-work would remain important and thus “HPC” was more important that “GPU”, and it was quite difficult to type streamcomputign correctly.

2017-2020: hitting all kinds of ceilings

The goal was to grow further, but this turned out to be more difficult than expected. All kinds of obstacles got in our way, and we even once shrunk in size. With trainings, coaching, reading and persistence, we got to understand the hurdles and finally could implement solutions. Looking back it was easy.

2021: Stream HPC Hungary

Hungary started as a group of freelancers. We were very happy with the quality provided by our Hungarian colleagues, and that was enough reason to invest more. We opened the new office in Q3.

The company now turned into a group of companies, and all was set up to extend the group more easily.

We grew back to 15 people by the end of the year.

2022: Benchmark.io

At ISC Benchmark.io was started. To help our customers do better benchmarks, we put all our knowledge into a separate product. Due to high demand for our consultancy services, it is in private beta only.

2022: Stream HPC Spain

Barcelona was opened in Q3.

The estimation is to grow to 25-35 people by the end of the year.

Contact us

Thank you for your interest in our company and services. We will try to answer your question within 24 hours.

There are three ways to get in contact:

    First Name (required)

    Last Name (required)

    Email (required)

    Company (required)

    Phone number

    Your Message

    See ‘about us‘ for the address and other business-specific information.