For Developers

Self-study material

We can keep everything for ourselves, but we like to share resources. It will take some time to learn it all, but you can always take our course for more experienced programmers.

[list1]

[/list1]

Please let us know if something is missing to complete the lists of books and tutorials.

[infobox type=”information”][widgets_on_pages id=Trainings][/infobox]

OpenCL feedback and bugs

Certainly the developers who started in 2009/2010 know how buggy the first drivers were. As OpenCL is a large project and is not in hands of one hardware-manufacturer, it might be difficult to get driver-errors over. But luckily Khronos provides two ways to give feedback via them: the OpenCL forums and the “Khronos Public Bugzilla”.

I have a request for the next version

The OpenCL forums are the right place for you. You can also discuss possible bugs here, if you are not sure and want others to test your code.

I found a bug!

Got to the Khronos Public Bugzilla, log in (using the email from your Khronos account, or make a new account). If you found a bug in a driver, fill it in like below under “conformance tests”.

opencl-bugreport

Best is to mention your bug-report on the forums and on twitter, so others can take a look at it. If nobody seems to react to it, send us a message and we’ll put some pressure where needed.

SUN jumping on the OpenCL-train?

Edit (27 May 2010): until now Oracle/SUN has not shown anything that would validate this rumour, and the job-posting is not there anymore. Follow us on Twitter to be the first to know if Oracle/SUN will have better support for GPGPU for Solaris and/or Java and/or its hardware.

Job Description: The Power to change your world begins with your work at Sun!
This is a software staff engineering position requiring the ability to design, test, implement and maintain innovative and advanced graphics software. The person in this role is expected to identify areas for improvement and modification of Sun’s platform products and contribute to Sun’s overall product strategy. This person will work closely with others within the team and, as required, across teams to accomplish project objectives. May assume a leadership role in projects, including such activities as leading projects, participator in product planning and technology evaluation and related activities. May use technical leadership and influence to negotiate product design features or applications, both internally, and with open source groups as needed.
Requirements: * Excellent problem solving, critical thinking, and communication
skills
* Excellent knowledge of the C/C++ and Java programming languages
* Thorough working knowledge of 3D graphics, GPU architecture, and
3D APIs, such as OpenGL & Direct3D
* Thorough working knowledge of shader-level languages such as
GLSL, HLSL, and/or Cg
* Experience designing cross-platform, public APIs for developers
(Windows/MacOS/Linux)
* Experience with multi-threaded programming and debugging techniques
* Experience with operating systems level engineering
* Experience with performance profiling, analysis, and optimization
Education and Experience: Univ degree in computer sciene or engineering plus 5 years direct experience

In other words, a specialist in everything graphics-cards and Java, in a completely new area. Since OpenGL is a already known area not needing such a specialist for, there is a very good chance it will target OpenCL. A good choice.

Sun has all reasons to jump the train with Java, since Microsoft is already integrating loads of OpenCL-tools into its Visual Studio product (created by AMD and nVidia). Java has still more than 3 times the market share than C#, but with this late jump the gap will be closer in favour of Microsoft. Remember C# can easily call C-functions (which is the language OpenCL is written in); Java has a far more difficult task when it comes to calling C-functions without hazards, which is sort of implemented here and here. If OpenCL would not be implemented by Java, C, C++ and C# will make a jump a hole in Java’s share.

Besides Java, also Sun’s super-multi-threaded Sparc-servers will be in trouble since the graphic-cards of nVidia and AMD are now serious competitors. There is no official support of Sparc-processors for OpenCL, wile AMD has included X86-support and IBM PowerPC-support (also working on Cell) a few months ago.

Then we have the databases Oracle and MySQL; Oracle depends on Java a lot. While we see experiments speeding up competitor PostgreSQL with GPU-power, Oracle might become the slow turtle in a GPU-ruled database-world. MySQL has the same development-speed and also “bleeding edge” releases, but Oracle might slow down its official support. Expect Microsoft to have SQL-server fully loaded in its next major release.

If Oracle/Sun jumped the train today, expect no OpenCL-products from Oracle/Sun before Q2-2011.

Strengthen our team as a remote worker (freelancer)

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!

NVIDIA beta-support for OpenCL 2.0 works on Linux too

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM (optional feature) is not.
Device side enqueuePartially. Surprisingly, it
works better than
on Windows
Almost OpenCL programs with device side queue we have tested work.

Some advanced examples with multi-level device side kernel enqueuing
and/or CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP fail. When using device
side queue, it's only possible to use 1D nd-range with uniform work
groups (or without specifying local size). 2D and 3D nd-ranges
don't work.
Work-group functionsYes
PipesNoPipe functions are defined in libOpenCL.so in 378.13 drivers,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNo
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNo

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

Making the release version of prototype code

math-hardware-codingWe transform your prototype into a performant product

Protoypes are very useful to proof the concepts are working and to test out various variations in a short time. Often Matlab code, Python, Excel or Labview are used, or libraries like OpenCV. The main problems with that code are often performance and portability. We solve that. Our software development projects are exactly very like the software projects you are used to; the main difference is that the code is much faster than if somebody else does it.

An overview of our advantages:

  • As we only have senior programmers, the code quality is higher.
  • We are very much in control and can therefore have more variety:
    • We make our software seamlessly work together with any code that doesn’t need a rewrite.
    • Our code can work on various hardware, like CPU, GPU and FPGA.
    • Our code works on Windows, Linux and OSX.
  • You will get the fastest software on the market. We have often have delivered speed-ups above expectation.

 

Neil Trevett on OpenCL

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

http://www.youtube.com/watch?v=_l4QemeMSwQ

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL”

Disruptive Technologies

Steve Streeting tweeted a few weeks ago: “Remember, experts are always wrong about disruptive tech, because it disrupts what they’re experts in.”. I’m happy I evangelise and work with such a disruptive technology and it will take time until it is bypassed by other technologies. And that other technologies will be probably be source-to-OpenCL-source compilers. At StreamHPC we therefore keep track of all these pre-compilers continuously.

Steve’s tweet got me triggered, since the stability-vs-progression-balance make changes quite hard (we see it all around us). Another reason was heard during the opening-speech of engineering world 2011 about “the cloud”, with a statement which went something like: “80% of today’s IT will be replaced by standardised cloud-solutions”. Most probably true; today any manager could and should click his/her “data from A to B”-report instead of buying a “oh, that’s very specialised and difficult” solution. But at the other side companies try to let their business live as long as possible. It’s therefore an intriguing balance.

So I came up with the idea to play my own devil’s advocate and try to disrupt GPGPU. I think it’s important to see what can disrupt the current parallel-kernel-execution model of OpenCL, CUDA and the others.

Continue reading “Disruptive Technologies”

Sheets GPGPU-day 2012 online

GPGPU Day Speakers_small.2
Photos made by Cyrille Favreau

Better now than never. It has almost been a year, but finally they’re online: the sheets of the GPGPU-day Amsterdam 2012.

You can find the sheets at http://www.platformparallel.com/nl/gpgpu-day-2012/abstracts/ – don’t hotlink the files, but link to this page. The abstracts should introduce the sheets, but if you need more info just ask them in the comments here.

PDFs from two unmentioned talks:

I hope you enjoy the sheets. On 20 June the second edition will take place – see you there!

 

Qualcomm Snapdragon 600 & 800 (Adreno 320 & 330)

snapdragon-800-mdps

[infobox type=”information”]

Need a Snapdragon programmer? Hire us!

[/infobox]

There are two Adreno GPUs currently available known to have/get OpenCL support: the 320 and 330, respectively in the Snapdragon 600 and Snapdragon 800.

Qualcomm does not provide a developer’s board, but the Sony Xperia Z is known to have OpenCL. Other phones are expected to have drivers pre-installed too. That is interesting, as new phones with Adreno 330 are shipped soon, such as the LG Optimus G2 LS980, Sony Xperia Z Ultra and a version of the Samsung Galaxy S4.

Drivers are still in beta and are known to have bugs (as of April 2013). This discussion is the most interesting to follow, if you want keep up to date.

There are plenty of tools available, such as the Snapdragon SDK for Android and these Tools and Resources for the Adreno GPU. In the latter you’ll find OpenCL samples you can run too (it is a Windows-installer, for some vague reason, so MAC and Linux users need to do some extracting). You can start building the code from this project.

http://www.youtube.com/watch?feature=player_embedded&v=CaS0kpozyMM

Boards

Focus is on the more recent Snapdragon 800.

Inforce IFC6410 – Snapdragon 600

IFC6410websiteThe Ifc6410 is a $149 costing single-board computer with Adreno 320 and Qualcomm Snapdragon S4 Pro – APQ8064.

Datasheet (PDF)

Order here.

 

Bsquare Mobile Development Boards for Snapdragon 800

Processor: Quad-core Krait 400 CPU at up to 2.3GHz per core (Snapdragon 8974) , Adreno 330 GPU, Hexagon QDSP6 V5. A few highlights: wifi n/ac, bluetooth 4, USB 3.0, NFC, 1280x720p screen (tablet: 1920x1080p), 2GB 800MHZ memory, 12MP+2MP camera. It all runs on Android 4.2 (Jelly Bean), so no Linaro-packages. More info the Qualcomm MDB page and on this Qualcomm blog.

Phone form factor: $799 – Tablet: $1099 – Also check out Bsquare’s information page for these products, but be aware there are some links to the wrong PDFs.

Warning: you cannot call or use your provider’s internet with these devices! The word ‘phone’ only refers to the form factor.

DragonBoard Snapdragon APQ8060A for Snapdragon 800

Some highlight: Snapdragon 8074 quad core processor, 2GB of LPDDR3 RAM, 16GB of eMMC, Wi-Fi, Bluetooth, GPS, HDMI out and qHD LCD with capacitive multi touch, Adreno 330.

Can be ordered via http://mydragonboard.org/db8074/ for $499,-

DB8074_annotated_EAP_v1.1

Sony Xperia Z phones

OpenCL_SonyThe Xperia Z1 and Xperia Z Ultra have OpenCL support and drivers are ready-loaded. Go here for an introduction of OpenCL on these phones.

It needs the Android NDK to run the OpenCL programs.

Sony sees great advantages in using OpenCL on their mobile phones – from the website:

You can also see that the execution speed is much faster using OpenCL on the GPU when compared to the plain single threaded c-code running on the CPU (tested on Sony Xperia Z1). In addition to the speed benefit, you may also find that you decrease energy consumption by utilizing OpenCL on the GPU compared to using standard programming methods on the CPU.

ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15

mont-blancAccording to the latest newsletter of the Mont-Blanc Project, it was explained that the GPU on a Samsung Exynos 5 is much faster and greener than its CPU: 3.5 times faster with half the energy. They built a supercomputer using 810 Exynos SoCs, that can deliver a 26 TFLOPS of peak performance. With the upcoming mobile GPUs becoming exponentially faster, they have all the expertise to build an even faster and greener ARM-supercomputer after this.

The Mont-Blanc compute cards deliver considerably higher performance; at 50% lower energy consumption, compared with previous ARM-based developer platforms.

The Mont-Blanc prototype is based on the Samsung Exynos 5 Dual SoC, which integrates a dual-core ARM Cortex-A15 and an on-chip ARM Mali-T604 GPU, and has been featured and market proven in advanced mobile devices. The dual-core ARM Cortex-A15 delivers twice the performance of the quad-core ARM Cortex-A9, used in the previous generation of ARM-based prototype, whilst consuming 20% less energy for the same workload. Furthermore, the on-chip ARM Mali-T604 GPU provides 3.5 times higher performance than the dual-core Cortex-A15, whilst consuming half the energy for the same workload.

Each Mont-Blanc compute card integrates one Samsung Exynos 5 Dual SoC, 4 GB of DDR3-1600 DRAM, a microSD slot for local storage and a 1 GbE NIC, all in an 85x56mm card (3.3×2.2 inches). A single Mont-Blanc blade integrates fifteen Mont-Blanc compute cards and a 1 GbE crossbar switch, which is connected to the rest of the system via two 10 GbE links. Nine Mont-Blanc blades fit into a standard BullX 9-blade INCA chassis. A complete Mont-Blanc rack hosts up to six such chassis, providing a total of 1620 ARM Cortex-A15 cores and 810 on-chip ARM Mali-T604 GPU accelerators, delivering 26 TFLOPS of peak performance.

“We are only scratching the surface of the Mont-Blanc potential”, says Alex Ramirez, coordinator of the Mont-Blanc project. “There is still room for improvement in our OpenCL algorithms, and for optimizations, such as executing on both the CPU and GPU simultaneously, or overlapping MPI communication with computation.”

Continue reading “ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15”

InsideHPC: SuperComputing. Where to from here?

In this video, Moderator Bob Feldman hosts a session entitled: Supercomputing: Where to from Here? Recorded at the National HPCC Conference 2011 in Newport.

Panelists:
Dr. Eng Lim Goh, SGI
Bill Feiereisen, Intel
Shumel Shottan, BlueARC
Steve Lyness, Appro International, Inc.
Marc Hamilton, HP Americas

http://www.youtube.com/watch?v=wI957eRr1kM

Below is a summary of what is told. It is just my notes, so go to the times mentioned to listen to the exact answers. Some details I did not write down, you might think are important, but I did not (or missed as I English is not my mother-tongue).

Continue reading “InsideHPC: SuperComputing. Where to from here?”

ZiiLabs Tablet

[infobox type=”information”]

Need a ZiiLabs ZMS-40 programmer? Hire us!

[/infobox]

Intel has bought ZiiLabs, but you can still order the ZMS-40.

ZiiLabs has an early access program for OpenCL on their StemCell processor, the 100-Core ZMS-40. It could do more than 20 GFLOPS/Watt, but no official numbers have been released.

It consists of:

  • ZMS-40 powered tablet
  • OpenCL compiler (no information if it is cross or native)
  • Code samples

Read more at http://www.ziilabs.com/products/software/opencl.php about their program. Also check the information on the ZMS-40 to see what the processor is capable of. Here are a few characteristics:

  • Quad 1.5 GHz ARM Cortex-A9 MP Cores
  • 96x fully-programmable StemCell Media Processing cores
  • 58 GFlops StemCell compute power

Differences from OpenCL 1.1 to 1.2

This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

Continue reading “Differences from OpenCL 1.1 to 1.2”

Keep The Hardware Focus

The real Apu

If you buy a car, the first choice is not often the kind of fuel. You first select on the engine-properties, the looks, the interior, the brand and for sure the total cost of ownership. The costs can be a reason to choose for a certain type of fuel though. In the parallel computation world it is different. There the fuel (CUDA or OpenCL) is the first decision and then the hardware is chosen. I think this is wrong and therefore speak a lot about CUDA-vs-OpenCL, while I think NVidia is a good choice for a whole list of algorithms.

If we give advise during a consult, we want to give the best advice. In case of CUDA, that would be based on budget to go for Tesla or the latest GTX; in case of OpenCL we can give much better advice on hardware. But actually starting with the technique is the worst thing you can do: focus on the hardware and then pick the technique that suits best.

IMPORTANT. The following is for understanding some concepts and limits only! It is pure theoretically, so I don’t claim any real-world results. Also what not is taken into account is how well different processors handle control-instructions (for, while, if, case, etc), which has quite some influence on actual performance.

Continue reading “Keep The Hardware Focus”

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

The entanglement of Bitcoins and compute-capabilities

Every now and then I read stories on Bitcoins (Wikipedia-article), as GPUs are used a lot to “mine” Bitcoins. They have some extensive benchmarks, and also their discussions giving me insights in specific parts of accelerators like GPUs. Also is this group very upwards if it comes to accepting new techniques. Today something changed: they are a bank now. One of the thoughts I had with this, I’d like to share with you.

If you look at various types of currencies, you see they all have various goals (trade, power, resources, energy, properties, etc). The inequality and differences are even more important than the amount. Various currencies are entangled to a certain goal or resource, but there is nothing entangled strongly to technology. Here is where Bitcoins come in…

Bitcoins are entangled with compute-power – a current benchmark for technological progress.

In this article I’d like to share how the tech-economy and Bitcoins are entangled, seen from the perspective of computing. I left out a lot of the “rules of economy” and hope you can put these in – the below text is just to guide you through the thought-process only. Disagreement is only good – as we learn all from it.

Continue reading “The entanglement of Bitcoins and compute-capabilities”

Performance can be measured as Throughput, Latency or Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

Computer Vision

Face_detectionComputing demands in computer vision are high, and often real-time processing with low latency is desirable. Computer vision can greatly benefit from parallelization as higher processing speeds can improve object recognition rates while FPGA solutions may reduce energy demands or support the perception of lag-free processing. At StreamHPC, we have supported several customers in optimizing their software to work on a lower power budget and on a higher speed. We can support you in dedicated solutions based on GPUs or FPGAs to meet your demands.

Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is  the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Image Processing

Vd-SharpVd-Blur2Vd-Edge3At StreamHPC, there is broad experience in the parallel, high-performance implementation of image filters. We have significantly improved the performance of various image processing software. For example, we have supported Pixelmator in achieving outstanding processing speeds on large image data, and users frequently praise the software’s speed in comparisons with competing software products.

StreamHPC is currently hosting an educational initiative that supports interested individuals in their efforts of porting algorithms from the open-source GEGL image processing framework to fast parallel versions based on OpenCL. GEGL is used by the popular image manipulation software Gimp as well as other free software. For more information on this project, look at our website OpenCL.org, which we dedicate to spreading knowledge on OpenCL.

PDFs of Monday 5 September

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

  • Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
  • Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
  • Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
  • Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
  • LAMA: Library for Accelerated Math Applications for C/C++.