Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

Posted by Vincent Hindriksen on 12 November 2012 with 3 Comments

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Continue reading “Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P” →

The entanglement of Bitcoins and compute-capabilities

Posted by Vincent Hindriksen on 8 December 2012

Every now and then I read stories on Bitcoins (Wikipedia-article), as GPUs are used a lot to “mine” Bitcoins. They have some extensive benchmarks, and also their discussions giving me insights in specific parts of accelerators like GPUs. Also is this group very upwards if it comes to accepting new techniques. Today something changed: they are a bank now. One of the thoughts I had with this, I’d like to share with you.

If you look at various types of currencies, you see they all have various goals (trade, power, resources, energy, properties, etc). The inequality and differences are even more important than the amount. Various currencies are entangled to a certain goal or resource, but there is nothing entangled strongly to technology. Here is where Bitcoins come in…

Bitcoins are entangled with compute-power – a current benchmark for technological progress.

In this article I’d like to share how the tech-economy and Bitcoins are entangled, seen from the perspective of computing. I left out a lot of the “rules of economy” and hope you can put these in – the below text is just to guide you through the thought-process only. Disagreement is only good – as we learn all from it.

Continue reading “The entanglement of Bitcoins and compute-capabilities” →

Applied GPGPU-days Amsterdam 2013

Posted by Vincent Hindriksen on 26 May 2013

December 2013: Videos are not ready yet, but link will be put here.

Amsterdam, 20 June – Applied GPGPU-days in Amsterdam. Keep your agenda free for this event.

What can you do with GPUs to speed up computations? This year we can see various examples where OpenCL and CUDA have been used. We hope to give you an answer if you can use GPUs for your software, research or algorithm.

After the success of last year (fully booked with 66 attendees), we now have reserved a larger location with place for 100 people. Difference with last year is that we focus more on applications, less on technical aspects.

The program has been made public recently:

Title of talk	Company/Institute	Presenter
Introduction to GPGPU and GPU-architectures	StreamHPC	Vincent Hindriksen
Blender Cycles & Tiles: Enhancing user experience	AtMind bv	Monique Dewanchand & Jeroen Bakker
XeonPhi vs K20: The fight of the titans	SURFsara	Evghenii Gaburov
A real-time simulation technique for ship-ship and ship-port interaction	PMH bv	Jo Pinkster
CUDA Accelerated Neural Networks	LIACS	Ana Balevic
Efficient Reconstruction of Biological Networks via Transitive Reduction on GPUs	TU Eindhoven	Anton Wijs
Running Petsc on GPUs with an example from fluid dynamics	SURFsara	Thomas Geenen
Connected Component Labelling, an embarrassingly sequential algorithm	Leeuwarden University	Jaap van de Loosdrecht
Visualizing sound and vibrations using a GPU and a 1024-channel microphone array	TU Eindhoven	Wouter Ouwens
Gravitational N-body simulations on 1 to many GPUs	Leiden observatory	Jeroen Bédorf

A few demos will be shown.

For more information, see the Platform Parallel webpage. Also to find other events by the platform.

Tickets are €75,-. If you are from a Dutch university or research institute affiliated with SURF, your ticket has been fully sponsored by SURFsara.

Associated events in the Netherlands

For the technical aspects (GPU-programming techniques, optimisation, etc) we have a special day: the GPU Dev Day 2013. More information on the Platform Parallel webpage. Date and place will be made public in June.

The first Khronos Meetup Benelux will take place just before the Applied GPGPU day, on 19 June in Amsterdam. More information on the meetup-page.

First Khronos Chapter meeting in Amsterdam: WebGL/OpenGL

Posted by Vincent Hindriksen on 16 January 2014

Thursday 13 February 2014 the first Khronos meetup in will take place. We expect a small group, so the location will be cozy and there will be enough time to talk with a beer. First round is on me, admission is free.

Goal is to learn about open media-standards from Khronos and others. So when OpenCV is discussed, we’ll also talk OpenVX. The target group is programmers and Indy developers who are interested in creating multi-OS and multi-device software.

Program

I am very thrilled to tell that Ton Roosendaal of the Blender Foundation will talk about the releationship between his Blender and Khronos OpenGL.

Second Maarten and Jurjen of ThreeDee Media will talk about WebGL, from a technical and a market view. Is WebGL ready for prime-time?

Then you can show your stuff. For that I’ll bring a good laptop with Windows 8.1 and Ubuntu 13.10 64.

Prepare for Meetup today!

See the Meetup-page for more information. See you there!

Market Positioning of Graphics and Compute solutions

Posted by Vincent Hindriksen on 11 June 2014 with 4 Comments

When compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

	AMD	NVIDIA	Intel	ARM
Desktop User *	A-series APU	–	Iris / Iris Pro	–
Laptop User *	A-series APU	–	Iris / Iris Pro	–
Mobile User	–	Tegra	Iris	Mali T720 / T4xx
Desktop Gamer	Radeon	GeForce	–	–
Laptop Gamer	Radeon M	GeForce M	–	–
Mobile High-end	–	Tegra K (?)	Iris Pro	Mali T760 / T6xx
Desktop Graphics	FirePro W	Quadro	–	–
Laptop Graphics	FirePro M	Quadro M	–	–
Desktop (DP) Compute	FirePro W	Titan (hdmi) / Tesla (no video-out)	XeonPhi	–
Laptop (DP) Compute	FirePro M	Quadro M	XeonPhi	–
Server (DP) Compute	FirePro S	Tesla	XeonPhi (active cooling!)	–
Cloud	Sky	Grid	–	–

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on Hardware.info:

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

Double Precision GFLOPS
Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
Faster support
(Free) Developer Tools
System Configuration Software (click-click and compute works)
Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.

Conclusion

Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

We more than halved the FPGA development time by using OpenCL

Posted by Vincent Hindriksen on 31 October 2015

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

We outperform traditional FPGA development companies in time-to-market and price.
We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
Our software also works on CPUs and GPUs for no additional charge.
The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

Computer Vision

Face_detection Computing demands in computer vision are high, and often real-time processing with low latency is desirable. Computer vision can greatly benefit from parallelization as higher processing speeds can improve object recognition rates while FPGA solutions may reduce energy demands or support the perception of lag-free processing. At StreamHPC, we have supported several customers in optimizing their software to work on a lower power budget and on a higher speed. We can support you in dedicated solutions based on GPUs or FPGAs to meet your demands.

Performance can be measured as Throughput, Latency or Processor Utilisation

Posted by Vincent Hindriksen on 19 July 2016

40225151 - fiber optic cable — Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

Transfers. Data-movements through cables, interconnects, etc.
Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

Reaction time: hardware/software noticing there is a job
Transport time: it takes time to copy data, especially when we talk GBs
Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

An introduction to Grid-processors: Parallella, Kalray and KnuPath

Posted by Vincent Hindriksen on 9 June 2016

grid We have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

single bus: like the PCIe-bus in a PC or the iMX6.
ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
star bus: a central communication core with the compute-cores around.
full mesh bus: each core is connected to each core.
grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

video encoding
signal processing
cryptography
neural networks

Continue reading “An introduction to Grid-processors: Parallella, Kalray and KnuPath” →

Online Tutorials are here

Posted by Vincent Hindriksen on 16 September 2016

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept — Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

Introductions (20 minutes)
- StreamHPC
- GPUs and paralellism
- OpenCL
By example: Getting started with OpenCL (30 minutes)
By example: Porting a simple program to OpenCL (30 minutes)
Q&A in parallel (30 minutes). Ask us any question, for instance:
- General OpenCL.
- OpenCL on GPUs.
- OpenCL on FPGAs.
- What algorithms work well with GPUs, CPUs and FPGAs.
- StreamHPC services.
The next steps (5 minutes).
Closing words (5 minutes).

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

[bigbluebutton token=89b561b86fff]

See you soon!

What is Khronos as of today?

Posted by Vincent Hindriksen on 4 May 2017 with 1 Comment

The Khronos Group is the organization behind APIs like OpenGL, Vulkan and OpenCL. Over one hundred companies are a member and decide together what your next year phone, camera, computer or media device will be capable of.

We work most with OpenCL, but you probably noticed we work with OpenGL, Vulkan and SPIR too. Currently they have the following APIs:

COLLADA, a file-format intended to facilitate interchange of 3D assets
EGL, an interface between Khronos rendering APIs such as OpenGL ES or OpenVG and the underlying native platform window system
glTF, a file format specification for 3D scenes and models
OpenCL, a cross-platform computation API.
OpenGL, a cross-platform computer graphics API
OpenGL ES, a derivative of OpenGL for use on mobile and embedded systems, such as cell phones, portable gaming devices, and more
OpenGL SC, a safety critical profile of OpenGL ES designed to meet the needs of the safety-critical market
OpenKCam, Advanced Camera Control API
OpenKODE, an API for providing abstracted, portable access to operating system resources such as file systems, networks and math libraries
OpenMAX, a layered set of three programming interfaces of various abstraction levels, providing access to multimedia functionality
OpenML, an API for capturing, transporting, processing, displaying, and synchronizing digital media
OpenSL ES, an audio API tuned for embedded systems, standardizing access to features such as 3D positional audio and MIDI playback
OpenVG, an API for accelerating processing of 2D vector graphics
OpenVX, Hardware acceleration API for Computer Vision applications and libraries
OpenWF, APIs for 2D graphics composition and display control
OpenXR, an open and royalty-free standard for virtual reality and augmented reality applications and devices
SPIR, a intermediate compiler target for OpenCL and Vulkan
StreamInput, an API for consistently handling input devices
Vulkan, a low-overhead computer graphics API
WebCL, a JavaScript binding to OpenCL within a browser
WebGL, a JavaScript binding to OpenGL ES within a browser on any platform supporting the OpenGL or OpenGL ES graphics standards

Too few people understand that the organization is very unique, as the biggest processor vendors are discussing collaborations and how to move the market, while they’re normally the fiercest competitors. Without Khronos it would have been a totally different world.

Improving FinanceBench for GPUs Part II – low hanging fruit

Posted by Adel Johar on 28 August 2020

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

Continue reading →

Hello

Welcome to the webpage of Stream HPC. We’re a company in Europe that work on solving the most difficult HPC problems with emphasis on scaling to GPUs and clusters. We have built up experience in speeding up software, designing performance oriented architectures, writing maintainable low-level code, selecting the best hardware for the job, and building benchmarks. Above all, we’re a customer oriented company, as we want our clients to feel in control, while we do that heavy lifting.

The company is multi-cultural and designed to be a safe space for everybody of our team – from LBGT+ to Asperger’s, we focus on making our differences our strengths. As you can read in the job self-assessment, we have 4 main strengths:

CPU development: algorithms, low-level code, architectures for CPU-based software. This includes clusters.
GPU development: algorithms, low-level code, architectures for GPU-based software. This includes graphics programming
Problem-solving: get from full understanding to full exploration quickly.
Self-managed teams: we don’t hire managers, but provide frameworks.

Our customers are all around the world, but especially North-America, West-Europe and East-Asia. We have built many high performance software that run from edge-computers to super-computers. See “What we do” for examples.

Our offices are in:

Amsterdam
Budapest
Barcelona

If you want to know more, feel free to get in contact.

See this page for Netherlands/Belgium, Hungary or Spain.

MPI in terms of OpenCL

Posted by Vincent Hindriksen on 19 August 2011 with 7 Comments

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Continue reading “MPI in terms of OpenCL” →

PDFs of Monday 19 September

Posted by Vincent Hindriksen on 19 September 2011

Already the fourth PDF-Monday. It takes quite some time, so I might keep it to 10 in the future – but till then enjoy! Not sure which to read? Pick the first one (for the rest there is not order).

Edit: and the last one, follow me on twitter to see the PDFs I’m reading. Reason is that hardly anyone clicked on the links to the PDFs.

I would like if you let others know in the comments which PDF you liked a lot.

Adding Physics to Animated Characters with Oriented Particles (Matthias Müller and Nuttapong Chentanez). Discusses how to accelerate movements of pieces of cloth attached to the bodies. Not time to read? There are nice pictures.

John F. Peddy’s analysis on the GPU market.

Hardware/Software Co-Design. Simple Solution to the Matrix Multiplication Problem using CUDA.

CUDA Based Algorithms for Simulating Cardiac Excitation Waves in a Rabbit Ventricle. Bioinformatics.

Real-time implementation of Bayesian models for multimodal perception using CUDA.

GPU performance prediction using parametrized models (Master-thesis by Andreas Resios)

A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations (Alexandre S. Nery, Nadia Nedjah, Felipe M.G. Franca, Lech Jozwiak)

Rapid Geocoding of Satellite SAR Images with Refined RPC Model. An ESA-presentation by Lu Zhang, Timo Balz and Mingsheng Liao.

A Parallel Algorithm for Flight Route Planning with CUDA (Master-thesis by Seçkîn Sanci). About the travelling salesman problem and much more.

Color-based High-Speed Recognition of Prints on Extruded Materials. Product-presentation on how to OCR printed text on cables.

Supplementary File of Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs (Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara). They sped up the BOINC-system (Folding@Home). Bit vague what they want to tell, but maybe you find it interesting.

Parallel Position Weight Matrices Algorithms (Mathieu Giraud, Jean-Stéphane Varré). Bioinformatics, DNA.

GPU-based High Performance Wave Propagation Simulation of Ischemia in Anatomically Detailed Ventricle (Lei Zhang, Changqing Gai, Kuanquan Wang, Weigang Lu, Wangmeng Zuo). Computation in medicine. Ischemia is a restriction in blood supply, generally due to factors in the blood vessels, with resultant damage or dysfunction of tissue

Per-Face Texture Mapping for Realtime Rendering. A Siggraph2011 presentation by Disney and NVidia.

Introduction to Parallel Computing. The CUDA 101 by Victor Eijkhout of University of Texas.

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing. Presentation on what you find out when putting the volt-meter directly on the GPU.

NUDA: Programming Graphics Processors with Extensible Languages. Presentation on NUDA to write less code for GPGPU.

Qt FRAMEWORK: An introduction to a cross platform application and user interface framework. Presentation on the Qt-platform – which has great #OpenCL-support.

Data Assimilation on future computer architectures. The problems projected for 2020.

Current Status of Standards for Augmented Reality (Christine Perey1, Timo Engelke and Carl Reed). not much to do with OpenCL, but tells an interesting purpose for it.

Parallel Computations of Vortex Core Structures in Superconductors (Master-thesis by Niclas E. Wennerdal).

Program the SAME Here and Over There: Data Parallel Programming Models and Intel Many Integrated Core Architecture. Presentation on how to program the Intel MIC.

Large-Scale Chemical Informatics on GPUs (Imran S. Haque, Vijay S. Pande). Book-chapter on the design and optimization of GPU implementations of two popular chemical similarity techniques: Gaussian shape overlay (GSO) and LINGO.

WebGL, WebCL and Beyond! A presentation by Neil Trevett of NVidia/Khronos.

Biomanycores, open-source parallel code for many-core bioinformatics (Mathieu Giraud, Stéphane Janot, Jean-Frédéric Berthelot, Charles Delte, Laetitia Jourdan , Dominique Lavenier , Hélène Touzet, Jean-Stéphane Varré). A short description on the project http://www.biomanycores.org.

Scaling mobile GPUs to 1000 GFLOPS

Posted by Vincent Hindriksen on 22 April 2013

On the 20th of April 2013 there was an interesting discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage). David is a specialist in CPU and GPU architectures (twitter, homepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, but there was something about this one that made me want to share it with special attention.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash. http://www.fpgacpu.org/papers/Gray_AutumnOfMooresLaw_SingularityUniversity_11-06-23.pdf …

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell, all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Continue reading “Scaling mobile GPUs to 1000 GFLOPS” →

Help write the book “Numerical Computations with GPUs”

Posted by Vincent Hindriksen on 24 September 2013

There is an interesting book coming up: “Numerical Computations with GPUs” – a book explaining various numerical algorithms with code in CUDA or OpenCL.

edit: At the moment there are 21 articles to be included in the book.

edit 2: book should be out in July

edit 3: Order via Springer International or Amazon US.
TOC:

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
A Guide to Implement Tridiagonal Solvers on GPUs.
Batch Matrix Exponentiation.
Efficient Batch LU and QR Decomposition on GPU.
A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems.
Sparse Matrix-Vector Product.
Solving Ordinary Differential Equations on GPUs.
GPU-based integration of large numbers of independent ODE systems.
Finite and spectral element methods on unstructured grids for flow and wave propagation problems.
A GPU implementation for solving the Convection Diffusion equation using the Local Modified SOR method.
Pseudorandom numbers generation for Monte Carlo simulations on GPUs: Open CL approach.
Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA.
GPU-Accelerated computation routines for quantum trajectories method.
Monte Carlo Simulation of Dynamic Systems on GPUs.
Fast Fourier Transform (FFT) on GPUs.
A Highly Efficient FFT Using Shared-Memory Multiplexing.
Increasing parallelism and reducing thread contentions in mapping localized N-body simulations to GPUs.

Continue reading “Help write the book “Numerical Computations with GPUs”” →

Guest-blog: Accelerating sequential machine vision algorithms with OpenMP and OpenCL

Posted by Vincent Hindriksen on 11 November 2013 with 1 Comment

Guest-blogger Jaap van de Loosdrecht wants to share his thesis with you. He leads the Centre of Expertise in Computer Vision department at NHL University of applied sciences and is the owner of his own company, and still managed to study and write a MSc-thesis. The thesis is interesting because it extensively compares OpenCL with OpenMP, especially chapters 7 an 8.

For those who are interested, my thesis “Acceleration sequential machine vision algorithms using commodity parallel hardware” is available at www.vdlmv.nl/thesis.

Keywords: Computer Vision, Image processing, Parallel programming, Multi-core CPU, GPU, C++, OpenMP, OpenCL.

Many other related research projects have considered using one domain specific algorithm to compare the best sequential implementation with the best parallel implementation on a specific hardware platform. This work was distinctive because it investigated how to speed up a whole library by parallelizing the algorithms in an economical way and execute them on multiple platforms.This work has:

Examined, compared and evaluated 22 programming languages and environments for parallel computing on multi-core CPUs and GPUs.

Chosen to use OpenMP as the standard for multi-core CPU programming and OpenCL for GPU programming.

Re-implemented a number of standard and well-known algorithms in Computer Vision using both standards.

Tested the performance of the implemented parallel algorithms and compared the performance to the sequential implementations of the commercially available software package VisionLab.

Evaluated the test results with a view to assessing:

Appropriateness of multi-core CPU and GPU architectures in Computer Vision.

Benefits and costs of parallel approaches to implementation of Computer Vision algorithms.

Using OpenMP it was demonstrated that many algorithms of a library could be parallelized in an economical way and that adequate speedups were achieved on two multi-core CPU platforms. With a considerable amount of extra effort, OpenCL was used to achieve much higher speedups for specific algorithms on dedicated GPUs.

At the end of the project, the choice of standards was re-evaluated including newly emerged ones. Recommendations are given for using standards in the future, and for future research and development.

Algorithmic improvements are suggested for Convolution and Connect Component Labelling.

Your feedback and/or questions are welcome.

If you put comments here, I’ll make sure Jaap van de Loosdrecht will get to know and answer your questions on the subjects discussed in his thesis.

The knowns and unknowns of the PEZY-SC accelerator at RIKEN

Posted by Vincent Hindriksen on 2 August 2015

The green500 is out and one unknown processor takes the number one position with a huge improvement over last year. It is a new super-computer installed at RIKEN with an incredible 7 GFLOPS/Watt. It is powered by the processor-boards at the right: two Xeons, 4 PEZY-SC 1.4 accelerators and 128GB DRAM, which have a combined performance of about 6.2 TFLOPS. It has been designed for immersive cooling.

The second and third positions are also powered by the PEZY-SC, before we find the winner of last year: the AMD FirePro S9150 and a bit after that the rest (mostly NVidia Tesla). One constant is the CPUs used: Intel XEON is taking most. To my big surprise no ARM64.

From the third to the first PEZY-SC installation there is an improvement of 13%. It seems the first two are the new type, called “bricks”, while the third is the same as last year. Comparing with that super from last year (4.4945 GFLOPS/W) there is an improvement of 42% and 25%. The 13% improvement from the previous version is interesting enough, but the 25% improvement on exactly the same system raised questions. Probably it is due to compiler-optimisations. As the November-version of the Green500 is much more strict, it will be clear if the rules were bent – let’s hope it’s for real!

It supports OpenCL!

When new accelerators support OpenCL, it gets accepted more easily. So it is very interesting the PEZY-SC runs on OpenCL. I asked at ISC and got explained it was a subset of OpenCL, but could not get the finger on which subset, nor could I get access to test it. It does mean that code that would run well on this machine is easy to port. And then I mean the same “easy” Intel uses for explaining the easyness of porting OpenMP software to XeonPhi: PEZI-specific optimisations and writing around the missing functionality would still take effort – the typical stuff we do at StreamHPC.

RIKEN Shoubu

Some information on “Shoubu” (“Iris” in Japanese), the top 1 on the Green 500. According to the Green500 it is 353.8 TFLOPS (based on 50kW, using an actual benchmark). On 25 June RIKEN announced the Shoubu is 2 PFLOPS (theoretical). If the full machine is used for the Green500, then the efficiency was only 18%!

Below are some images of the installation.

Source: http://www.exascaler.co.jp/wp-content/uploads/2015/06/20150625.pdf

An important part is Exascaler’s immersion technology, what I understood is a spin-off of PEZY. I’m very curious what the AMD FirePro S9150 does when it uses immersion-cooling – I think we have to do some frying at the office to find out.

PEZY-SC1.4 and PEZY-SC2

PEZY started with a multi-core processor of 512 cores, the PEZY-1. The PEZY-SC has 1024 cores and has had a few gradual upgrades – currently PEZY-SC 1.4 (“the brick”) is installed.

PEZY-SC Specification:

Logic Cores(PE)	1,024
Core Frequency	733MHz
Peak Performance	Floating Point. Single 3.0TFlops / Double 1.5TFlops
Host Interface	PCI Express GEN3.0 x8Lane x 4Port (x16 bifurcation available) JESD204B Protocol support
DRAM Interface	DDR4, DDR3 combo 64bit x 8Port Max B/W 1533.6GB/s +Ultra WIDE IO SDRAM (2,048bit) x 2Port Max B/W 102.4GB/s
Control CPU	ARM926 dual core
Process Node	28nm
Package	FCBGA 47.5mm x 47.5mm, Ball Pitch 1mm, 2,112pin

Source: http://pezy.co.jp/en/products/pezy-sc.html

Development on PEZY-SC2 is ongoing, which will have a staggering 4096 cores. Ofcourse efficiency has to go up (if the 18% is correct), to make this a good upgrade.

There is no promise on when the PEZY-SC2 will be announced, but it will certainly surprise us again hen it arrives.

Call for speakers: IEEE eScience Conference in Amsterdam

Posted by Vincent Hindriksen on 23 May 2018

We’re in the program committee of the 14th IEEE eScience Conference in Amsterdam, organized by the Netherlands eScience Center. It will be held from 29 October to 1 November 2018, and the deadlines for sending the abstracts is Monday 18 June.

The conference brings together leading international researchers and research software engineers from all disciplines to present and discuss how digital technology impacts scientific practice. eScience promotes innovation in collaborative, computationally- or data-intensive research across all disciplines, throughout the research lifecycle.

Continue reading “Call for speakers: IEEE eScience Conference in Amsterdam” →

About Us

Stream HPC is a software development company in parallel software for many-core processors. We provide professional software development services, training and consulting to help you increase compute performance in software while lowering hardware-costs.

We have 3 locations.

Stream HPC B.V. (Amsterdam)

Koningin Wilhelminaplein 1 – 40601
1062 HG Amsterdam
Netherlands, Europe

phone: +31 854865760 (office) or +31 6 45400456 (cell)

Visit us in Amsterdam

Stream HPC Hungary Kft. (Budapest)

Science Park 1117 Budapest
Irinyi József u. 4-20.
Hungary, Europe

Stream HPC Spain S.L. (Barcelona)

Plaza de Catalunya 1, 4th floor
Barcelona 08002
Spain, Europe

History

2010 – 2013: the freelancing years

The company started as a freelancing business, with one focus: Programming GPUs with OpenCL. It was though, as back then the G in GPU stood for “Graphics only”.

The name was “StreamComputing” = A high-performance computer system that analyzes multiple data streams from many sources live. The main goal was to create software algorithms that analyze the data in real time as it streams in to increase speed and accuracy when dealing with data handling and analysis, which was in line with that.

2014: first hope

Four years later the first employee, Anca, was hired. Later that year the freelancing business was was turned into a limited company. GPUs got more seen as data-processors and trainings were the main income. Projects were still small, GPGPU was a world of early adopters and most time was invested on trainings.

First contact was made with AMD, now one of our biggest clients.

2015-2017: initial growth

Stream grew to a handful of employees, and we did projects for HSA foundation, Stanford, AMD, Zeiss, Nokia, Philips and many lesser known companies.

Trainings were still done, but were by far not the main resource of income anymore. We tried some FPGA-work, but found that most promises were not implemented yet.

2017: a new name

We renamed the company to Stream HPC. There were several reasons. As we focused more on customers from Asia and North America, we needed the .com, which was unavailable. Getting the new name was quite a quest, but we got there: by customers we were often referred to as “Stream”, a business coach assured us that CPU-work would remain important and thus “HPC” was more important that “GPU”, and it was quite difficult to type streamcomputign correctly.

2017-2020: hitting all kinds of ceilings

The goal was to grow further, but this turned out to be more difficult than expected. All kinds of obstacles got in our way, and we even once shrunk in size. With trainings, coaching, reading and persistence, we got to understand the hurdles and finally could implement solutions. Looking back it was easy.

2021: Stream HPC Hungary

Hungary started as a group of freelancers. We were very happy with the quality provided by our Hungarian colleagues, and that was enough reason to invest more. We opened the new office in Q3.

The company now turned into a group of companies, and all was set up to extend the group more easily.

We grew back to 15 people by the end of the year.

2022: Benchmark.io

At ISC Benchmark.io was started. To help our customers do better benchmarks, we put all our knowledge into a separate product. Due to high demand for our consultancy services, it is in private beta only.

2022: Stream HPC Spain

Barcelona was opened in Q3.

The estimation is to grow to 25-35 people by the end of the year.