The most noticeable processors from NVIDIA, AMD and Intel

Posted by Vincent Hindriksen on 6 May 2016 with 1 Comment

AMD-Intel-NVidia 10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently.

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

Tesla K80 (Kepler)

The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
16 GB HBM2 (720 GB/s).
NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

0.5 TFLOPS SP (GPU), 1 TFLOPS HP
10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

32GB GDDR5 global memory
2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

Radeon Nano and FirePro S9300X2 (Fiji)

Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
Nano 175 Watt, S9300X2 300 Watt
Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

CPU with built-in GPU
About one TFLOPS
TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

55MB cache, 22 cores
AVX 2.0 (256 bit vector operations)
DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPhi Knights landing

Available in socket and PCI version
3 TFLOPS DP, 6 TFLOPS SP
AVX 512 (512 bit vector operations)
16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

Task-parallel processor
Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

CPU with built-in GPU
0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

9 questions on OpenCL’s future answered at IWOCL

Posted by Vincent Hindriksen on 21 April 2016 with 1 Comment

During the panel discussion some very interesting questions were asked, I’d like to share with you.

Should the Khronos group poll the community more often about the future of OpenCL?

I asked it on twitter, and this is the current result: khronos-community-feedback

Khronos needs more feedback from OpenCL developers, to better serve the user base. Tell the OpenCL working group what holds you back in solving your specific problems here. Want more influence? You can also join the OpenCL advisory board, or join Khronos with your company. Get in contact with Neil Trevett for more information.

How to (further) popularise OpenCL?

While the open standard is popular at IWOCL, it is not popular enough at universities. NVidia puts a lot of effort in convincing academics that OpenCL is not as good as CUDA and to keep CUDA as the only GPGPU API in the curriculum.

Altera: “OpenCL is important to be thought at universities, because of the low-level parts, it creates better programmers”. And I agree, too many freshly graduated CS students don’t understand malloc() and say “The compiler should solve this for me”.

The short answer is: more marketing.

At StreamHPC we have been supporting OpenCL with marketing (via this blog) since 2010. 6 years already. We are now developing the website opencl.org to continue the effort, while we have diversified at the company.

How to get all vendors to OpenCL 2.0?

Ofcourse this was a question targeted at NVidia, and thus Neil Trevett answered this one. Use a carrot and not a stick, as it is business in the end.

Think more marketing and more apps. We already have a big list: opencl-library-ecosphere

Know more active(!) projects? Share!

Can we break the backwards compatibility to advance faster?

This was a question from the panel to the audience. From what I sensed, the audience and panel are quite open to this. This would mean that OpenCL could make a big step forward, fixing the initial problems. Deprecation would be the way to go the panel said. (OpenCL 2.3 deprecates all nuisances and OpenCL 3.0 is a redesign? Or will it take longer?)

See also the question below on better serving FPGAs and DSPs.

Should we do a specs freeze and harden the implementations?

Michael Wong (OpenMP) was clear on this. Learn from C++98. Two years were focused on hardening the implementations. After that it took 11 years to restart the innovation process and get to C++11! So don’t do a specs freeze.

How to evolve OpenCL faster?

Vendor extensions are the only option.

At StreamHPC we have discussed a lot about it, especially fall-backs. In most cases it is very doable to create slower fall-backs, and in other cases (like with special features on i.e. FPGAs) it can be the only option to make it work.

How to get more robust OpenCL implementations?

Open sourcing the Vulkan conformance tests was a very good decision to make Vulkan more robust. Khronos gets a lot of feedback on the test cases. It will be discussed soon to what extend this also can be done for OpenCL.

Test-cases from open source libraries are often used to create more test cases.

How to better support FPGAs and DSPs?

Now GPUs are the majority and democracy doesn’t work for the minorities.

An option to better support FPGAs and DSPs in OpenCL is to introduce feature sets. A lesson learnt from Vulkan. This way GPU vendors don’t need to spend time implementing features that they don’t find interesting.

Do we see you at IWOCL 2017?

Location will be announced later. Boston and Toronto are mentioned.

Comparing Syntax for CUDA, OpenCL and HiP

Posted by Vincent Hindriksen on 5 April 2016 with 1 Comment

Both CUDA and OpenCL are well-known GPGPU-languages. Unfortunately there are some slight differences between the languages, which are shown below.

You might have heard of HiP, the language that AMD made to support both modern AMD Fiji GPUs and CUDA-devices. CUDA can be (mostly automatically) translated to HiP and from that moment your code also supports AMD high-end devices.

To give an overview how HiP compares to other APIs, Ben Sanders made an overview. Below you’ll find the table for CUDA, OpenCL and HiP, slightly altered to be more complete. The languages HC and C++AMP can be found in the original.

Term	CUDA	OpenCL	HiP
Device	int deviceId	cl_device	int deviceId
Queue	cudaStream_t	cl_command_queue	hipStream_t
Event	cudaEvent_t	cl_event	hipEvent_t
Memory	void *	cl_mem	void *
Grid of threads	grid	NDRange	grid
Subgroup of threads	block	work-group	block
Thread	thread	work-item	thread
Scheduled execution	warp	sub-group (warp, wavefront, etc)	warp
Thread-index	threadIdx.x	get_local_id(0)	hipThreadIdx_x
Block-index	blockIdx.x	get_group_id(0)	hipBlockIdx_x
Block-dim	blockDim.x	get_local_size(0)	hipBlockDim_x
Grid-dim	gridDim.x	get_global_size(0)	hipGridDim_x
Device Kernel	__global__	__kernel	__global__
Device Function	__device__	N/A. Implied in device compilation	__device__
Host Function	__host_ (default)	N/A. Implied in host compilation.	__host_ (default)
Host + Device Function	__host____device__	N/A.	__host____device__
Kernel Launch	<<< >>>	clEnqueueNDRangeKernel	hipLaunchKernel
Global Memory	__global__	__global	__global__
Group Memory	__shared__	__local	__shared__
Private Memory	(default)	__private	(default)
Constant	__constant__	__constant	__constant__
Thread Synchronisation	__syncthreads	barrier(CLK_LOCAL_MEMFENCE)	__syncthreads
Atomic Builtins	atomicAdd	atomic_add	atomicAdd
Precise Math	cos(f)	cos(f)	cos(f)
Fast Math	__cos(f)	native_cos(f)	__cos(f)
Vector	float4	float4	float4

You see that HiP borrowed from CUDA.

The discussion is ofcourse if all alike APIs shouldn’t use the same wordings. A best thing would be to mix for the best, as CUDA’s “shared” is much more clearer than OpenCL’s “local”. OpenCL’s functions on locations and dimensions (get_global_id(0) and such) on the other had, are often more appreciated than what CUDA offers. CUDA’s “<<< >>>” breaks all C/C++ compilers, making it very hard to make a frontend of IDE-plugin.

I hope you found the above useful to better understand the differences between CUDA and OpenCL, but also to see how HiP comes into the picture.

Meet us in April

Posted by Vincent Hindriksen on 4 April 2016

9017503_m The coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamHPC team! For appointments, send an email to contact@streamhpc.com.

Meet us at ParallelCon (6 April 2016, Heidelberg, Germany). Besides the crash course (see below), we also have a talk on Vulkan.
Crash Course OpenCL @ ParallelCon (8 April 2016, Heidelberg, Germany). This is part of the conference – you can still buy tickets!
Meet us in Toronto (11 April 2016, Toronto, Canada). In Toronto for business, with time for appointments.
Meet us at IWOCL (19 April 2016, Vienna, Austria). The event-of-the-year for all OpenCL. So ofcourse we’re there.
Meet us in Grenoble (25 April 2016, Grenoble, France). For a training we’re there the whole week. On Thursday and Friday there is time for appointments.

We’re happy to talk business and about technology. Also giving presentations at your company is an option.

This information was previously communicated via the newsletter and on LinkedIn.

AMD’s infographic on HBM

Posted by Vincent Hindriksen on 30 March 2016 with 1 Comment

If one company is bad at bragging, then it’s AMD. It’s two main competitors are a lot better in that – NVIDIA even bragged about their upcoming GPUs having HBM. So I was surprised that recently I encountered a nice infographic, where AMD was actually bragging. And they deserved to do it!

I wanted to have comparisons with Intel/Micron’s HBC, but I leave that for another post as the good information is often a year old.

Very close to the processor

It’s using a high-speed bus on the substrate.

HBM-package

And yes, it really matters to be closer to the processor.

HBM-stacked

HBM versus GDDR5

Bus width from 32-bit tot 1024-bit
Clockspeed down. We need to wait for how it’s calculated.
Bandwidth up a lot. We can expect 1TB/s for GPUs now
Required voltage 14% down, which saves a lot of energy

HBM-vs-GDDR5

Better GB/s per Watt

So for maintaining 320GB/s you would need 30 Watts. Now you need 9 Watts. As the reduction in power for the Radeon NANO is almost 100W, you understand that this tells only part of the power-reductions made possible.

HBM-Watt

A lot smaller

Yes, 94% less surface area. Only part of the reason is the stacking.

HBM-size

Standards AMD has pioneered

HBM has been engineered and implemented by AMD, made a standard by JEDEC and put into sylicon by Hynix.

AMD-standards

And finally some bragging! AMD has made many standards we use daily, but never knew it was AMD technology.

Mantle. The predecessor of Vulkan, DirectX 12 and more
GDDR 1 to 5. Now being replaced by HBM, and not GDDR6
Wake-on-LAN. You never knew! Intel and IBM made it into a standard, but AMD introduced the Magic Packet in 1995.
DisplayPort Adaptive-Sync. Previously known as FreeSync.
X86-64. The reason why you find “amd64” packages in Linux.
Integrated Memory Controllers.
On-die GPUs.
Consumer Multicore CPU, the Athlon 62 X2.
HSA. Not in the list, probably because it’s a recent advancement.

Want to see the full infographic? Click here.

An example of real-world, end-user OpenCL usage

Posted by Vincent Hindriksen on 28 March 2016

We ask all our customers if we could use their story on our webpage. For competition reasons, this is often not possible. The people of CEWE Stiftung & Co. KGaA were so kind to share his experience since he did a OpenCL training with us and we reviewed his code.

Enjoy his story on his experience from the training till now!

This year, the CEWE is planning to implement some program code of the CEWE Photoworld in OpenCL. This software is used for the creation and purchase of photo products such as the CEWE Photobook, CEWE Calendars, greeting cards and other products with an installation base of about 10 million throughout Europe. It is written in Qt and works on Windows, Mac and Linux.

In the next version, CEWE plans to improve the speed of image effects such as the oil painting filter, to become more useful in the world of photo manipulation. Customers like to some imaging effects to improve photo products, to get even more individual results, fix accidentally badly focused photos and so on.

Continue reading “An example of real-world, end-user OpenCL usage” →

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

Posted by Michael Noisternig on 18 March 2016 with 2 Comments

random_300 Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamHPC has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.

Generation

Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.

New: OpenCL Crash Courses

Posted by Vincent Hindriksen on 1 March 2016

To see if OpenCL is the right choice for your project, we now only ask one day of your time. This enables you to quickly assess the technology without a big investment.

Throughout Europe we give crash courses in OpenCL. After just one day you will know:

The models used to define OpenCL.
If OpenCL is an option for your project.
How to read and understand OpenCL code.
Code simple OpenCL programs.
Differences between CPUs, GPUs and FPGAs.

There are two types: GPU-oriented and FPGA-oriented. We’ve selected Altera FPGAs and AMD FirePro GPUs to do the standard trainings.

[eme_events category=4]

If you are interested to get a certain crash course in another city than currently scheduled, fill in the below form to get notified when the crash course of your city of choice.

Loading…

We will add more dates and places continuously. If you want to host an OpenCL crash course event, get in contact.

Note: crash courses are intended to get you in contact with software accelerators, so it doesn’t replace a full training.

Atomic operations for floats in OpenCL – improved

Posted by Anca Hamuraru on 9 February 2016

Looking around in our code and on our intranet, I found lots of great and unique solutions to get the most performance out. Ofcourse we cannot share all of those, but there are some code-snippets that just have to get out. Below is one of them.

In OpenCL there is only atomic_add or atomic_mul for ints, and not for floats. Unfortunately there are situations there is no other way to implement the algorithm, than with atomics and floats. Already in 2011 Igor Suhorukov shared a solution to get atomic functions for floats, using atomic_cmpxchg(). Here is his example for atomic add:

[raw] inline void AtomicAdd_g_f(volatile __global float *source, const float operand) { union { unsigned int intVal; float floatVal; } newVal; union { unsigned int intVal; float floatVal; } prevVal; do { prevVal.floatVal = *source; newVal.floatVal = prevVal.floatVal + operand; } while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal); } [/raw]

Unfortunately this implementation is not guaranteed to produce the correct results because OpenCL does not enforce global/local memory consistency across all work-items from all work-groups. In other words, a read from the buffer source is not guaranteed to actually perform a read from the specified global buffer; it could, for example, return the value stored in a local cache. For more details check the chapter 3.3 – “Memory Model”, subchapter “Memory Consistency” of the OpenCL specification.
To be sure that the implementation for AtomicAdd_g_f is correct we need to use the value returned by the function atomic_cmpxchg. This guarantees that the actual value stored in global memory is returned.

As it seems our improved version is hidden deep in the code of GROMACS, here’s the code you should use:

[raw] _INLINE_ void atomicAdd_g_f(volatile __global float *addr, float val) { union { unsigned int u32; float f32; } next, expected, current; current.f32 = *addr; do { expected.f32 = current.f32; next.f32 = expected.f32 + val; current.u32 = atomic_cmpxchg( (volatile __global unsigned int *)addr, expected.u32, next.u32); } while( current.u32 != expected.u32 ); } [/raw]

As was mentioned in Suhorukov’s blog post, you can change the global to local, and implement the other operations likewise:

[raw] Atomic_mul_g_f(): next.floatVal = expected.f32 * operand; Atomic_mad_g_f(source, operand1, operand2): next.f32 = mad(operand1, operand2, expected.f32); Atomic_div_g_f(): next.f32 = expected.f32 / operand; [/raw]

Enjoy!

Performance of 5 accelerators in 4 images

Posted by Vincent Hindriksen on 8 December 2015

running If there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip — Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

gflops-per-chip — The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

Intel Xeon 2699 v3: € 6,560.
Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
NVidia Tesla K40: € 4,070
AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

costs-per-gflops-per-chip — For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

PHD position at university of Newcastle

Posted by Vincent Hindriksen on 18 November 2015

At the university of Newcastle they use OpenCL for researching the performance balance between software and hardware. This resource management isn’t limited to shared memory systems, but extends to mixed architectures where batches of co-processors and other resources make it much more complex problem to solve. They chose OpenCL as it gives both inter-node and intra-node resource-management.

Currently they offer a PhD position and seek the brilliant mind that can solve the heterogeneous puzzle like a chess player. It is a continuation of years of research and the full description is in the PDF below.

Continue reading “PHD position at university of Newcastle” →

SC15 news from Monday

Posted by Vincent Hindriksen on 16 November 2015 with 4 Comments

Warning: below is raw material, and needs some editing.

Today there was quite some news around OpenCL, I’m afraid I can’t wait till later to have all news covered. Some news is unexpected, some is great. Let’s start with the great news, as the unexpected news needs some discussion.

Khronos released OpenCL 2.1 final specs

As of today you can download the header files and specs from https://www.khronos.org/opencl/. The biggest changes are:

C++ kernels (still separate source files, which is to be tackled by SYCL)
Subgroups are now a core functionality. This enables finer grain control of hardware threading.
New function clCloneKernel enables copying of kernel objects and state for safe implementation of copy constructors in wrapper classes. Hear all Java and .NET folks cheer?
Low-latency device timer queries for alignment of profiling data between device and host code.

OpenCL 2.1 will be supported by AMD. Intel was very loud with support when the provisional specs got released, but gave no comments today. Other vendors did not release an official statement.

Khronos released SPIR-V 1.0 final specs

SPIR-V 1.0 can represent the full capabilities of OpenCL 2.1 kernels.

This is very important! OpenCL is not the only language anymore that is seen as input for GPU-compilers. Neither is OpenCL hostcode the only API that can handle the compute shaders, as also Vulkan can do this. Lots of details still have to be seen, as not all SPIRV compilers will have full support for all OpenCL-related commands.

With the specs the following tools have been released:

A bi-directional translator between LLVM to SPIR-V to enable flexible use of both intermediate languages in tool chains.
An OpenCL C to LLVM compiler that generates SPIR-V through the above translator, as Clang can compile OpenCL 1.2/2.0 C kernels.
A SPIR-V assembler and disassembler.

SPIRV will make many frontends possible, giving co-processor powers to every programming language that exists. I will blog more about SPIRV possibilities the coming year.

Intel claims OpenMP is up to 10x faster than OpenCL

The below image appeared on Twitter, claiming that OpenMP was much faster than OpenCL. Some discussion later, we could conclude they compared apples and oranges. We’re happy to peer-review the results, putting the claims in a full perspective where MKL and operation mode is mentioned. Unfortunately they did not react, as <sarcasm>we will be very happy to admit that for the first time in history a directive language is faster than an explicit language – finally we have magic!</sarcasm>

.@IntelHPC Need a second opinion to show it is true that OpenMP is faster than OpenCL on the XeonPhi?

— StreamHPC (@StreamHPC) November 16, 2015

CT3iJBlVEAA2WCY — Left half is FFT and GEMM based, probably using Intel’s KML. All algorithms seems to be run in a different mode (native mode) when using OpenMP, for which intel did not provide OpenCL driver support for.

We get back later this week on Intel and their upcoming Xeon+FPGA chip, if OpenCL is the best language for that job. It ofcourse is possible that they try to run OpenMP on the FPGA, but then this would be big surprise. Truth is that Intel doesn’t like this powerful open standard intruding the HPC market, where they have a monopoly.

AMD claims OpenCL is hardly used in HPC

Well, this is one of those claims that they did not really think through. OpenCL is used in HPC quite a lot, but mostly on NVidia hardware. Why not just CUDA there? Well, there is demand for OpenCL for several reasons:

Avoid vendor lock-in.
Making code work on more hardware.
General interest in co-processors, not specific one brand.
Initial code is being developed on different hardware.
…

Thing is that NVidia did a superb job in getting their processors in supercomputers and clouds. So OpenCL is mostly run on NVidia hardware and a therefore the biggest reason why that company is so successful in slowing the advancement of the standard by rolling out upgrades 4 years later. Even though I tried to get the story out, NVidia is not eager to tell the beautiful love story between OpenCL and the NVidia co-processor, as the latter has CUDA as its wife.

Also at HPC sites with Intel XeonPhi gets OpenCL love. Same here: Intel prefers to tell about their OpenMP instead of OpenCL.

AMD has few HPC sites and indeed there is where OpenCL is used.

No, we’re not happy that AMD tells such things, only to promote its own new languages.

CUDA goes AMD and open source

AMD now supports CUDA! The details: they have made a tool that can compile CUDA to “HiP” – HiP is a new language without much details at the moment. Yes, I have the same questions as you are asking now.

Also Google joined in and showed progress on their open source implementation of CUDA. Phoronix is currently the best source for this initative and today they shared a story with a link to slides from Google on the project. the results are great up: “it is to 51% faster on internal end-to-end benchmarks, on par with open-source benchmarks, compile time is 8% faster on average and 2.4x faster for pathological compilations compared to NVIDIA’s official CUDA compiler (NVCC)”.

For compiling CUDA in LLVM you need three parts:

a pre-processor that works around the non-standard <<<…>>> notation and splits off the kernels.
a source-to-source compiler for the kernels.
an bridge between the CUDA API and another API, like OpenCL.

Google has done most of this and now focuses mostly on performance. The OpenCL community can use this to use this project to make a complete CUDA-to-SPIRV compiler and use the rest to improve POCL.

Khronos gets a more open homepage

Starting today you can help keeping the Khronos webpage more up-to-date. Just put a pull request at https://github.com/KhronosGroup/Khronosdotorg and wait until it gets accepted. This should help the pages be more up-to-date, as you can now improve the webpages in more ways.

More news?

AMD released HCC, a C++ language with OpenMP built-in that doesn’t compile to SPIRV.

There have been tutorials and talks on OpenCL, which I should have shared with you earlier.

Tomorrow another post with more news. If I forgot something on Sunday or Monday, I’ll add it here.

OpenCL at SC15 – the booths to go to

Posted by Vincent Hindriksen on 15 November 2015 with 1 Comment

This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.

Khronos

The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamHPC team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

Reflex | Booth: #3115.
BittWare | Booth #3010.
Nallatech | Booth #1639.
Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.

HPC OpenCL

Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.

Others

You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!

An OpenCL-on-FPGAs presentation in a bar

Posted by Vincent Hindriksen on 6 November 2015

What do you do when you want to explain OpenCL and FPGAs and OpenCL-on-FPGAs to a beer drinking crowd in just 15 minutes? Well, you simply can’t go deep into the matter. On a Thursday evening, 5 November 2015, I was standing on a chair for a beer-loving group of Hackers and Founders with my laser-powered presenter, trying not to loose everybody. It was not the first time I stood on that particular chair – some years ago I presented about OpenCL-on-GPUs.

Below you find the full presentation – feel free to use and change the slides for yourself (PDF here).

Do you want us to present OpenCL, accelerators or performance engineering in a talk tailored for your audience? Just give us a call.

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup

Posted by Vincent Hindriksen on 4 November 2015

Example of modelled versus measured water activity ('effective' concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC — Example of modelled versus measured water activity (‘effective’ concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC

As we cannot use the performance results for most of our commercial projects because they contain sensitive data, we were happy that Dr. David Topping from the University of Manchester was so kind to allow us to share the data for the UNIFAC project. The goal for this project was simple: port the UNIFAC algorithm to the Intel XeonPhi using OpenCL. We got a total of 485x speedup: 3.0x for going from single-core to multi-core CPU, 53.9x for implementing algorithmic, low-level improvements and a new memory layout design, and 3.0x for using the XeonPhi via OpenCL. To remain fair, we used the 160x speedup from multi-core CPU in the title, not from serial code. Continue reading “Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup” →

Our “new” Bug Hunting & Removal Service

Posted by Vincent Hindriksen on 31 October 2015

We recently started a new service, which we were actually doing for years already. You can also learn from this: one can become very experienced in a task and then noticing years later that it can be a service on itself. So starting ~~years ago~~today, you can hire us to find all types of bugs – we accept bets.

Hidden bugs

There are many types of bugs that need attention before performance-concerns can even be tackled. Software that works well with a few threads and on small data-sets, can completely burn the computer when scaled up. During our existence we got very experienced as bug-hunters, as each project needed to have this phase. We now have an environment, fully tailored to support bug hunting. We now want to offer this as a separate service.

A selection of common errors we encounter:

Reading/writing outside array boundaries.
Race conditions.
Arithmetic overflow or underflow.
Arithmetic precision.
Null pointer dereference.
Using an uninitialized variable.
Resource leaks.
etc…

Often related is testability, for which we also have effective solutions for:

Randomness in software.
Undefined results, which are labelled as “don’t care”.
Unknown required precision.

We have the experience and the tools to get many of those error problems tackled and solved. Ask us for more information today to get cleaner and more robust software.

We more than halved the FPGA development time by using OpenCL

Posted by Vincent Hindriksen on 31 October 2015

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

We outperform traditional FPGA development companies in time-to-market and price.
We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
Our software also works on CPUs and GPUs for no additional charge.
The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

OpenCL in the cloud – API beta launching in a month

Posted by Vincent Hindriksen on 31 October 2015

We’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

The magic of clGetKernelWorkGroupInfo

Posted by Vincent Hindriksen on 22 October 2015

It’s not easy to get the available private memory size – actually it’s impossible to get this information directly from the device/drivers, using the OpenCL API. This can only be explained after you dive deep into clGetKernelWorkGroupInfo – the function that tells you how well your kernel fits on the device. It is strange this function is not often discussed.

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

Returns the amount of local memory, in bytes, being used by a kernel (per work-group). Use CL_DEVICE_LOCAL_MEM_SIZE to find out the maximum.

CL_KERNEL_PRIVATE_MEM_SIZE

Returns the minimum amount of private memory, in bytes, used by each work-item in the kernel.

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

This answers the question “What is the maximum value for global_work_size argument that can be given to clEnqueueNDRangeKernel?”. The result is of type size_t[3].

CL_KERNEL_WORK_GROUP_SIZE

The is the same for local_work_size. The kernel’s resource requirements (register usage etc.) are used, to determine what this work-group size should be.

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

If __attribute__((reqd_work_group_size(X, Y, Z))) is used, then (X, Y, Z) is returned, else (0, 0, 0).

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

It returns a performance-hint: if the total number of work-items is a multiple of this number, then you’ll get good results. So no more remembering 32 or 64 for specific GPUs, but simply kick in a call to this function.

Combined with clDeviceInfo’s CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, you can fine-tune your workgroup-size in case you need the group-size to be as large as possible.

Call for papers: SYCL workshop, 13-March-2016, Barcelona, Spain

Posted by Vincent Hindriksen on 22 October 2015 with 2 Comments

A high-level language has been on OpenCL’s roadmap since the years, and would be started once the foundations were ready. Therefore with OpenCL 2.0, SYCL was born.

To keep the pace high, a SYCL workshop is being organised. This week the call-for-papers is opened, which you can read below.

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Barcelona, Spain Sunday, 13th March, 2016

SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction
layer that builds on the underlying concepts, portability and efficiency of
OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL
enables single source development where C++ template functions can contain both
host and device code to construct complex algorithms that use OpenCL
acceleration, and then re-use them throughout their source code on different
types of data. SYCL has also been designed with resilience from the start, by
featuring, for example, a fall-back mechanism to automatically re-enqueue
kernels on different queues in case of a failure.

The SYCL Workshop aims to gather together SYCL’s users, researchers, educators
and implementors to encourage and grow a community of users behind the SYCL
standard, and related work in C++ for heterogeneous architectures. This will be
a half-day workshop. SYCL’16 will be held in Barcelona, 13 March 2016,
co-located with PPoPP 2016, HPCA 2016, CGO 2016 and LLVM 2016.

Travel Awards

Student authors who present papers in this workshop are eligible to apply for
travel awards. Further details will be announced after notification of
acceptance.

Important Dates

Submissions: 23rd November
Notification: 21st December
Final version: 24th January, 2016
Workshop: Sunday, 13th March, 2016

Submission Guidelines

All submissions must be made electronically through the conference submission
site, at https://easychair.org/conferences/?conf=sycl16.
Submissions may be one of the following:

Extended abstract: Two pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Short Paper: Four to six pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Submissions must be in PDF format and printable on US Letter and A4 sized
paper. All submissions will be peer-reviewed by at least two members of the
program committee. We will aim to give longer presentation slots to papers than
to extended abstracts. Conference papers will not be published, but made
available through the website, alongside the slides used for each presentation.
The aim is to enable authors to get feedback and ideas that can later go into
other publications. We will encourage questions and discussions during the
workshop, to create an open environment for the community to engage with.

Topics of interest include, but are not limited to:

Applications implemented using SYCL

C++ Libraries using SYCL

C++ programming models for OpenCL (C++AMP, Boost.Compute, …)

Other C++ applications using OpenCL

New proposals to the SYCL specification

Integration of SYCL with other programming models

Compilation techniques to optimise SYCL kernels

Performance comparisons between SYCL and other programming models

Implementation of SYCL on novel architectures (FPGA, DSP, …)

Using SYCL in fault-tolerant systems

Reports on SYCL implementations

Debuggers, profilers and tools

Organising Committee

Paul Keir, University of the West of Scotland (UK)
Ruyman Reyes, Codeplay Software Ltd, Edinburgh (UK)

Program Committee

Jens Breitbart, TU Munich
Alastair Donaldson, Imperial College London, UK
Christophe Dubach, University of Edinburgh, UK
Joel Falcou, LRI, Université Paris-Sud, France
Benedict Gaster, University of the West of England, UK
Vincent Hindriksen, StreamHPC, Netherlands
Christopher Jefferson, St. Andrews University, UK
Ronan Keryell, Xilinx, Ireland
Zoltán Porkoláb, ELTE, Hungary
Francisco de Sande, Universidad de La Laguna, Spain
Ana Lucia Varbanescu, University of Amsterdam, Netherlands
Josef Weidendorfer, TU Munich

Yes, we’re in the Program Committee as one of the few non-academics. We’re looking forward to read your proposal!

If you have a blog, feel free to copy the above text and repost it.

We’re a member of Khronos now!

Posted by Vincent Hindriksen on 29 September 2015 with 2 Comments

For years we have had a good collaboration with the Khronos group, mainly due our community presence. Now it was time to get into a closer collaboration and become an official Contributor Member (logo not there yet). This effectively means two things:

Instead of complaining on the blog and on twitter, we can now discuss it within the working group. 🙂
If we accidentally find interesting info we now know is under NDA, we won’t share with you anymore. 🙁

Our goal for collaborating with Khronos remains the same: help OpenCL and its community advance. We therefore keep building OpenCL.org, writing articles on OpenCL and organising events in the years to come.

One of our goals of the coming year is to get more vendors on OpenCL 2.0. If you think we should have more goals on our agenda, write them in the comments.

StreamHPC communications

NVidia

Tesla K80 (Kepler)

Pascal P100 (Pascal)

Titan Black (Kepler) and GTX 980 (Maxwell)

Tegra X1

AMD

FirePro S9170 (GCN)

Radeon Nano and FirePro S9300X2 (Fiji)

AMD Carrizo A10 8890k APU (HSA)

Intel

Xeon E5-2699 v4

XeonPhi Knights landing

Xeon+FPGA

Iris GPUs

Selecting the right hardware

Very close to the processor

HBM versus GDDR5

Better GB/s per Watt

A lot smaller

Standards AMD has pioneered

Generation

The numbers

Costs

Which to choose?

Khronos released OpenCL 2.1 final specs

Khronos released SPIR-V 1.0 final specs

Intel claims OpenMP is up to 10x faster than OpenCL

AMD claims OpenCL is hardly used in HPC

CUDA goes AMD and open source

Khronos gets a more open homepage

More news?

Khronos

OpenCL on FPGAs

Embedded OpenCL

HPC OpenCL

Others

Send your photos and news!

Hidden bugs

How we did it

The bonus

Our Unique Selling Points

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

CL_KERNEL_PRIVATE_MEM_SIZE

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Read more?

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Travel Awards

Important Dates

Submission Guidelines

Topics of interest include, but are not limited to:

Organising Committee

Program Committee