AccelerEyes ArrayFire

Posted by Vincent Hindriksen on 2 March 2012 with 2 Comments

There is a lot going on at the path to GPGPU 2.0 – the libraries on top of OpenCL and/or CUDA. Among many solutions we see for example Microsoft with C++ AMP on top of DirectCompute, NVidia (and more) with OpenACC, and now AccelerEyes (most known for their Matlab-extension Jacket and libJacket) with ArrayFire.

I want you to show how easy programming GPUs can be when using such libraries – know that for using all features such as complex numbers, multi-GPU and linear algebra functions, you need to buy the full version. Prices start at $2500,- for a workstation/server with 2 GPUs.

It comes in two flavours: for OpenCL (C++) and for CUDA (C, C++, Fortran). The code for both is the same, so you can easily switch – though you still see references to cuda.h you can compile most examples from the CUDA-version using the OpenCL-version with little editing. Let’s look a little into what it can do.

Continue reading “AccelerEyes ArrayFire” →

Basic Concepts: online kernel compiling

Posted by Vincent Hindriksen on 28 October 2011 with 1 Comment

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling” →

AMD gDEBugger 6.2 for Linux

Posted by Vincent Hindriksen on 19 May 2012 with 1 Comment

The printf-funtion in kernels isn’t the solution to everything, so hence profilers and debuggers specially tailored for GPU-programming. On Windows there is a lot of choice, but mostly only if you have a paid version of Visual Studio. On Linux you have GDB, but that program is not really user-friendly for the GUI-lovers.

For AMD there is now gDEBugger again available for Linux. Again, as version 5.8 by Gremedy worked with Linux, after AMD bought the company it got Windows-only for version 6. A few weeks ago, 10 months after 6.0, Linux-binaries got back with version 6.2. It supports OpenCL 1.2, OpenGL 3.2 and quite some extensions. As only AMD is supported, later more on debugging OpenCL-applications on NVidia and Intel.

Installation is quite straightforward. For creating a menu-item, you’ll find an useful image in /opt/gDEBugger6.2.xxx/tutorial/images/.

Continue reading “AMD gDEBugger 6.2 for Linux” →

Get ready for conversions of large-scale CUDA software to AMD hardware

Posted by Vincent Hindriksen on 7 September 2016 with 1 Comment

IMG_20160829_172857_cropped In the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware” →

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Posted by Vincent Hindriksen on 23 August 2019

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading →

How many threads can run on a GPU?

Posted by Vincent Hindriksen on 24 January 2017 with 5 Comments

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?” →

AMD OpenCL coding competition

Posted by Vincent Hindriksen on 3 October 2011 with 1 Comment

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition” →

Privacy Policy

Who we are

We are a group of companies, based in the Netherlands, Hungary and Spain. We help our customers get their code run fast by optimizing the computations and using accelerators. We do this since 2010.

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymised string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service Privacy Policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Forms

Form-data is sent to self-hosted software and is not read by any third-party party.

Tracking

We use anonymized tracking to find out:

Which pages are visited how often
Which subjects are popular
Which pages are clicked through
From which countries or states the visitors are

During a visit/session, you get a random ID.

Cookies

If you leave a comment on our site you may opt in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

Tracking cookies last for 24 hours.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Who we share your data with

None of the data is shared with any third party. Marketing reports don’t contain any personal data.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

Anonymous tracking data is not thrown away, to find trends over the years.

What rights you have over your data

If you have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where your data is sent

Visitor comments and forms are checked through an automated spam detection service, ReCAPTCHA and Akismet.

Reporting problems

We are not in the business of monetizing user data, and believe in finding new customers through content.

As software and plugins change after updates, we are sometimes surprised that more is collected than we configured.

If anything is incorrect or not legal, please email to privacy@streamhpc.com. If you have generic questions, go to the contact page or email to info@streamhpc.com.

Big Data

Big_Bang_Data_exhibit_at_CCCB_17 Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

capture, data-curation & data-management,
analysis, search & querying,
sharing, storage & transfer,
visualization, and
information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

FortranCL working example

Posted by Vincent Hindriksen on 16 December 2013 with 2 Comments

f90 — The ’96 book is still available here, and has some good explanations of numerical mathematics. Oh, the good old times..

Last week I needed to get Fortran working with OpenCL. As the example-page is not up-to-date and not much documentation is on the interwebs outside the official page, this was not as straight-forward as I hoped. The test-suite and this article provided code I could actually use. First I wanted to have things in a module, second I needed to control which device I wanted to use, third I needed function-names that could be used in a larger project. The result is below, and hopefully usable for the Fortran folks around who want to add some OpenCL-kernels to their existing code.

It uses the two-step initialisation we know from C, for safe memory allocation. It is based on the utils.f90 from the test-suite.

The only good way to translate is the Rose-compiler – which is a pain to install. I tried various f2c-scripts (from the 90’s, but they all failed. I must say that continuous switching between Fortran-mode and C-mode was the hardest part of the porting.

If you have tips&tricks to use OpenCL from Fortran, let everybody know in the comments. Also let me know if the code doesn’t work for you, or you have improvements (like better error-handling).

The rest of utils.f90 (which I renamed to clutils.f90 for better integration) is mostly the same – only this subroutine needed changes:

(...)

subroutine cl_initialize(platform_id, device_id, device, context, command_queue)
!use ISO_C_BINDING
type(cl_device_id),     intent(out)     :: device
type(cl_context),       intent(out)     :: context
type(cl_command_queue), intent(out)     :: command_queue
integer                                 :: platform_id
integer                                 :: device_id

integer :: platform_count, device_count, ierr
character(len = 100) :: info
type(cl_platform_id) :: platform
type(cl_platform_id), allocatable, target :: platform_ids(:)
type(cl_device_id), allocatable, target :: device_ids(:)

! get the platform ID
call clGetPlatformIDs(platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')
allocate(platform_ids(platform_count))
call clGetPlatformIDs(platform_ids, platform_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL platform.')

if (platform_id .gt. platform_count .or. platform_id .lt. 1) platform_id = 0
platform = platform_ids(platform_id)

! get the device ID
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')
allocate(device_ids(device_count))
call clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, device_ids, device_count, ierr)
if(ierr /= CL_SUCCESS) call error_exit('Cannot get CL device.')

if (device_id .gt. device_count .or. device_id .lt. 1) device_id = 1
device = device_ids(device_id)

! get the device name and print it
call clGetDeviceInfo(device, CL_DEVICE_NAME, info, ierr)
print*, "CL device: ", info

! create the context and the command queue
context = clCreateContext(platform, device, ierr)
command_queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, ierr)

end subroutine cl_initialize

(...)

Continue reading “FortranCL working example” →

We sponsor HiPEAC again this year

Posted by Vincent Hindriksen on 17 December 2014

HiPEAC is an academic oriented, 3-day, international conference around HPC, compilers and processors. Last year was in Vienna, this year in Amsterdam – where StreamHPC also is based. That was an extra reason to go for silver sponsorship, besides I find this conference very important.

Compilers have the job to do magic. Last year I had nice feedback on my request to give the developers feedback where in the code the compiler struggles – effectively slapping the guy/gal instead of trying to solve it with more magic. Also learned a lot about compilers in general, listened to GPGPU-talks, discussed about HPC, and most of all: met a lot of very interesting people.

Why you should come too? I give you five reasons:

Learn about compilers and GPU-techniques, in depth.
Have great discussions about the latest and greatest research, before it’s news.
Meet great people who create the compilers you use (or the reverse).
Visit Amsterdam, Netherlands – I can be your guide. Flights are cheap.
Only spend €400 for the full 3 day programme and a unique dinner with 500 people – compare that to SC14 and GTC!

If you are seeking for a job in HPC, compilers and GPGPU, you should really come over. We’re there, but several other sponsors are also looking for new employees too.

See the tracks at HiPEAC, which has a lot more GPU-oriented talks than last year. I selected a few from the list in bold.

Monday

Opening address
William J. Dally, Challenges for Future Computing Systems
Euro-TM: Final Workshop of the Euro-TM COST Action
Session 1. Processor Core Design
CS²: Cryptography and Security in Computing Systems
IMPACT: Polyhedral Compilation Techniques
MCS: Integration of mixed-criticality subsystems on multi-core and manycore processors
EEHCO: Energy Efficiency with Heterogeneous Computing
INA-OCMC: Interconnection Network Architecture: On-Chip, Multi-Chip
WAPCO: Approximate Computing
SoftErr: Mitigation of soft errors: from adding selective redundancy to changing the abstraction stack
Session 2. Data Parallelism, GPUs
James Larus, It’s the End of the World as We Know It (And I Feel Fine)
ENTRE: EXCESS & NANOSTREAMS
SiPhotonics: Exploiting Silicon Photonics for energy-efficient high-performance computing
HetComp: Heterogeneous Computing: Models, Methods, Tools, and Applications
Session 3. Caching
Session 4. I/O, SSDs, Flash Memory
Student poster session / Welcome reception

Tuesday

Don’t forget to meet us at the industrial poster-sessions.

Rudy Lauwereins, New memory technologies and their impact on computer architectures
Thank you HiPEAC
Session 5. Emerging Memory Technologies
EMC²: Mixed Criticality Applications and Implementation Approaches
ADEPT: Energy Efficiency in High-Performance and Embedded Computing
MULTIPROG: Programmability Issues for Heterogeneous Multicores
WRC: Reconfigurable Computing
TISU: Transfer to Industry and Start-ups
HiStencils: High-Performance Stencil Computations
MILS: Architecture and Assurance for Secure Systems
Programmability: Programming Models for Large Scale Heterogeneous Systems
Industrial Poster Session
INNO2015: Innovation actions in Advanced Computing CFP
Session 6. Energy, Power, Performance
DCE: Dynamic Compilation Everywhere
EUROSERVER: Green Computing Node for European Micro-servers
PolyComp: Polyhedral Compilation without Polyhedra
HiPPES4CogApp: High-Performance Predictable Embedded Systems for Cognitive Applications
Industrial Session
Session 7. Memory Optimization
Session 8. Speculation and Transactional Execution
Canal tour / Museum visit / Banquet

Wednesday

Burton J. Smith, Resource Management in PACORA
HiPEAC 2016
Session 9. Resource Management and Interconnects
PARMA-DITAM: Parallel Programming and Run-Time Management Techniques for Many-core Architectures + Design Tools and Architectures for Multi Core Embedded Computing Platforms
ADAPT: Adaptive Self-tuning Computing System
PEGPUM: Power-Efficient GPU and Many-core Computing
HiRES: High-performance and Real-time Embedded Systems
RAPIDO: Rapid Simulation and Performance Evaluation: Methods and Tools
MemTDAC: Memristor Technology, Design, Automation and Computing
DataFlow, Computing in Space: DataFlow SuperComputing
IDEA: Investigating Data Flow modeling for Embedded computing Architectures
TACLe: Timing Analysis on Code-Level
EU Projects Poster Session
Session 10. Compilers
HIP3ES: High Performance Energy Efficient Embedded Systems
HPES: High Performance Embedded Systems
Session 11. Concurrency
Session 12. Methods (Simulation and Modeling)

Hopefully till then!

Porting CUDA to OpenCL

OpenCL-in-CUDA-car — OpenCL speed-meter in 1970 Plymouth Cuda car

Why port your CUDA-acelerated software to OpenCL? Simply, to make your software also run on AMD CPU/APU/GPU, Intel CPU/GPU, Altera FPGA, Xilinx FPGA, Imagination PowerVR, ARM MALI, Qualcomm Snapdragon and upcoming architectures.

And as OpenCL is an open standard, supported by many vendors, it has much more security that it will keep existing in the future than any proprietary language.

If you look at the history of GPU-programming you’ll find many frameworks, such as BrookGPU, Close-to-Metal, Brook+, Rapidmind, CUDA and OpenCL. CUDA was the best choice from 2008 to 2013, as OpenCL had to catch up. Now that OpenCL is gaining serious market traction, the demand for porting legacy CUDA-code to OpenCL rises – as we clearly notice here.

We are very experienced in porting legacy CUDA-code to all flavours of OpenCL (CPU, GPU, FPGA, embedded). Ofcourse porting from OpenCL to CUDA is also possible, as well as updating legacy CUDA-code to the latest standards of CUDA 7.0 and later. We can also add several improvements to the architecture; we have made many customers happy with giving them more structured and documented code, while working on the port. Want to see some work we did? We ported molecular dynamics software Gromacs from CUDA to OpenCL.

[button text=”Request a pilot, code-review or more information” url=”https://streamhpc.com/consultancy/request-more-information/” color=”orange” target=”_blank”]

Selecting Applications Suitable for Porting to the GPU

Posted by Vincent Hindriksen on 14 March 2018

Assessing software is never comparing apples to apples

The goal of this writing is to explain which applications are suitable to be ported to OpenCL and run on GPU (or multiple GPUs). It is done by showing the main differences between GPU and CPU, and by listing features and characteristics of problems and algorithms, which can make use of highly parallel architecture of GPU and simply run faster on graphic cards. Additionally, there is a list of issues that can decrease potential speed-up.

It does not try to be complete, but tries to focus on the most essential parts of assessing if code is a good candidate for porting to the GPU.

GPU vs CPU

The biggest difference between a GPU and a CPU is how they process tasks, due to different purposes. A CPU has a few (usually 4 or 8, but up to 32) ”fat” cores optimized for sequential serial processing like running an operating system, Microsoft Word, a web browser etc, while a GPU has a thousands of ”thin” cores designed to be very efficient when running hundreds of thousands of alike tasks simultaneously.

A CPU is very good at multi-tasking, whereas a GPU is very good at repetitive tasks. GPUs offer much more raw computational power compared to CPUs, but they would completely fail to run an operating system. Compare this to 4 motor cycles (CPU) of 1 truck (GPU) delivering goods – when the goods have to be delivered to customers throughout the city the motor cycles win, when all goods have to be delivered to a few supermarkets the truck wins.

Most problems need both processors to deliver the best value of system performance, price, and power. The GPU does the heavy lifting (truck brings goods to distribution centers) and the CPU does the flexible part of the job (motor cycles distributing doing deliveries).

Assessing software for GPU-porting fitness

Software that does not meet the performance requirement (time taken / time available), is always a potential candidate for being ported to a GPU. Continue reading “Selecting Applications Suitable for Porting to the GPU” →

Aparapi: OpenCL in Java

Posted by Vincent Hindriksen on 3 August 2011 with 2 Comments

Edit: Aparapi has been open sourced and many issues have already been fixed and improved.

If you have an AMD GPU/APU, you should try Aparapi. This software lets you write OpenCL-code in Java pretty high-level. The idea is that is sort of that it processes the Java intermediate code to search for loops and then create optimised OpenCL-kernels. Just download Aparapi and try the two examples. As the current version is still in alpha, it is not flawless yet. What I think is important when having worked with Aparapi is that you learn how to keep it simple – like you know that you can gain most speed on straight roads and turns slow down.

The Aparapi-team tries to avoid explicit defining of local memory, but it is still possible by using the @Local annotation. Such decisions show the team wants Aparapi to be high-level. It also integrates well with JavaCL and JOCL, so for the kernels you already have created, you can mix. You can also check out a video introducing Aprapi (it is video 15, if #-linking doesn’t work).

Time to create your own project. As not all errors are documented or are solved in the upcoming version, below you will find a list of common errors and how to easily solve them.

Continue reading “Aparapi: OpenCL in Java” →

Caffe and Torch7 ported to AMD GPUs, MXnet WIP

Posted by Vincent Hindriksen on 15 May 2017

Last week AMD released ports of Caffe, Torch and (work-in-progress) MXnet, so these frameworks now work on AMD GPUs. With the Radeon MI6, MI8 MI25 (25 TFLOPS half precision) to be released soonish, it’s ofcourse simply needed to have software run on these high end GPUs.

The ports have been announced in December. You see the MI25 is about 1.45x faster then the Titan XP. With the release of three frameworks, current GPUs can now be benchmarked and compared.

Especially the expected good performance/price ratio will make this very interesting, especially on large installations. Another slide discussed which frameworks will be ported: Caffe, TensorFlow, Torch7, MxNet, CNTK, Chainer and Theano.

This leaves HIP-ports of TensorFlow, CNTK, Chainer and Theano still be released. Continue reading “Caffe and Torch7 ported to AMD GPUs, MXnet WIP” →

Power to the Vector Processor

Posted by Vincent Hindriksen on 12 August 2011

Reducing energy-consumption is “hot”

After reading this article “Nvidia is losing on the HPC front” by The Inquirer which mixes up the demand for low-power architectures with the other side of the market: the demand for high performance. It made me think that it is not that clear there are two markets using the same technology. Also Nvidia has proven it to be not true, since the super-computer “Nebuale” uses almost half the watts per flop as the #1. How come? I quote The Register from an article of one year old:

>>When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)<<

Nvidia is (and should) be very proud. But actually I’m already looking forward when hybrids get more common. They will really shake up the HPC-market (as The Register agrees) in lowering latency between GPU and CPU and lowering energy-consumption. But where we can find a bigger market is the mobile market.

Continue reading “Power to the Vector Processor” →

The current state of WebCL

Posted by Vincent Hindriksen on 28 September 2011 with 7 Comments

Years ago Microsoft was in court as it claimed Internet Explorer could not be removed from Windows without breaking the system, while competitors claimed it could. Why was this so important? Because (as it seems) the browser would get more important than the OS and internet as important as electricity in the office and at home. I was therefore very happy to see the introduction of WebGL, the browser-plugin for OpenGL, as this would push web-interfaces as the default for user-interfaces. WebCL is a browser-plugin to run OpenCL-kernels. Meaning that more powerful hardware-devices are available to JavaScript. This post is work-in-progress as I try to find more resources! Seen stuff like this? Let me know.

Continue reading “The current state of WebCL” →

AMD now leads the Green500

Posted by Vincent Hindriksen on 25 November 2014

With SC14 behind us, there are a few things I’d like to share with you. I’d like to start with the biggest win for OpenCL: AMD leading in the most power-efficient GPU-cluster.

A few months ago I wrote a theoretical article on how to build the cheapest and greenest supercomputer to enter the Top500 and Green500. There I showed that AMD would theoretically win on both GFLOPS/costs and GFLOPS/Watt. Last week I learned a large cluster is actually being built in Germany, which now leads the Green500 (GFLOPS/Watt). It is powered by Intel Ivy Bridge CPUs, an FDR Infiniband network and accelerated by air-cooled(!) AMD FirePro S9150 GPUs, as can be seen on the Green 500 report of November. The score: 5.27 GFLOPS per Watt, mostly because of AMD’s surprise act: extremely efficient SGEMM and DGEMM.

The first NVIDIA Tesla-based system on the list is at #3 with 4.45 GFLOPS per Watt for a liquid cooled system. If the AMD FirePro S9150 would be oil or water cooled, the system could go to over 6 GFLOPS per Watt. I’m expecting such system on the Green500 of June. The PEZY-SC (#2 on the list) is a very interesting, unexpected newcomer to the field – I’ll share more with you later, as I heard it supports OpenCL.

The price metric

The cluster at GSI Helmholtz Center has around 1.65 double precision PetaFlops (theoretical). Let’s do the same calculation as with the 150 GFLOPS system using the latest prices, only taking the accelerator part.

640 x AMD FirePro S9150.

2.53 GFLOPS * 640 = 1.62 TFLOPS (I rounded down to 2.0 GFLOPS in the other article)
US$ 3300. Total price: $2.112M. Price per TFLOPS: $1.304M
235 Watt * 640 = 150 kWatt (excluding network, CPU, etc)

640 x NVIDIA Tesla K40x

1.42 GFLOPS * 640 = 0.91 TFLOPS
US$ 3160 (got down a lot due to introduction K80!). Total price: $2.022M. Price per TFLOPS: $2.225M
235 Watt * 640 = 150 kWatt

640 x Intel XeonPhi 7120P

1.21 GFLOPS * 640 = 0.65 TFLOPS
US$ 3450. Total price: 2.208$M. Price per TFLOPS: $3.397M
300 Watt * 640 = 192 kWatt

So it’s pretty clear, why GSI chose AMD: $92M or $209M less costs for the same GFLOPS. Also note that more GFLOPS per accelerator is important to lower overhead.

What to expect from June’s Green500

Next year Nvidia probably comes with Maxwell, which probably will do very well in the Green500. Intel has their new XeonPhi, but it’s a very new architecture and no samples have arrived yet – I would be surprised, as they over-promised for too long now. Besides bringing surprises, Intel’s other strengths are its vast collaborations and strong fanbase – the past years I heard the most ridiculous responses on why such underperforming accelerator was chosen instead of FirePro or Tesla, so it’s certainly aiming for a rampage (based on hope). AMD did not enclose any information on a new version of the S9150 (Something like S9200 or S9250).

Then there are the dual GPUs, which have no advantages but lower energy-usage. The K80 just arrived, but the number don’t add up yet – we’ll have to see when the samples arrive. AMD did not say anything about the next version of the S10000, but probably arrives next year – no ETA. Intel did not do dual-chip cards until now. These systems can be built more compact, as 4 GPUs per system is becoming a standard.

Another important change will be the CPUs with embedded CPU being used in the clusters, where now mostly Intel Xeons rule the world. Intel’s Iris Pro line and AMD new Carrizo APU could certainly get more popular, as more complex code can be accelerated very well by such processors. Also 64-bit ARM-processors we’ll see more – hopefully with GPU. This subject I’ll handle in a separate article, as OpenCL could be a big enabler for easy offloading.

Based on the current information I have available, Nvidia aims for Maxwell based Teslas, AMD with S9150 and the dual-GPU variant, Intel with none (aiming for November 2015). It’ll be exciting to see HPC get to 6+ GFLOPS/Watt as a standard – I find that more important than building the biggest cluster.

OpenCL will help select hardware from that year’s winner, not being locked in to that year’s loser. Meanwhile at StreamHPC we will keep building OpenCL-based software, to help our customers pick that winner.

Tutorials

During our courses/trainings we will teach you the best of what you can find here.

We try to keep the following information as complete as possible, so please contact us if something is missing.

Learning OpenCL

[list1]

Hands on OpenCL, by Simon McIntosh-Smith and Tom Deakin from the University of Bristol in the UK. It currently is the most up-to-date tutorial on OpenCL, including code for lab-sessions.
Bruno Jurkovski wrote a clear quickstart.
AMD introduction to OpenCL.
MacResearch playlist on Youtube. Code of episode 3 and 6. Zip of PDFs.
CMSoft’s complete OpenCL tutorial.
The Code Project has a series on OpenCL, episodes 1, 2, 3, 4, 5, 6, 7 and 8. By Rob Farber.
Dr.Dobb’s has a series called “CUDA, Supercomputing for the Masses”. It is CUDA-oriented, but you can learn a lot about GPGPU in general and on NVIDIA specific optimisations. Login to their site and then you can access parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21. Registration is free.
AMD’s university program. This is loads of information!
NVIDIA’s OpenCL pages provide all you need to program on NVIDIA.
Enjalot’s adventures in OpenCL giving the basics in OpenCL and pyOpenCL.
StreamHPC’s basic concepts with various tips&tricks on OpenCL.
KISTI Supercomputing Learning Centre has a beginners course for OpenCL. Material including PDFs and code is available on SF.net.
OpenCL cookbook by Dhruba Bandopadhyay.
Anteru’s introduction to OpenCL, part #1, #2 and #3.

[/list1]

OpenCL Optimisation guides

Intel Xeon and XeonPhi
NVidia (CUDA, but same applies to OpenCL)
AMD GPUs and CPUs
ARM MALI T600
Altera FPGAs

Not available (yet):

Imagination PowerVR
Qualcomm Adreno
Xilinx FPGAs

[infobox type=”information”][widgets_on_pages id=Trainings][/infobox]

University courses

OpenCL-based GPU-programming courses

[list1]

Marcus Bannermann university course made for the university of Erlangen, Germany.
Advanced Parallel Programming is a course on parallel programming by professor John Cavazos of University of Delaware.
Programming for Performance is a course on parallel programming by Jonathan Eyolfson of University of Waterloo.
Manchester OpenCL tutorial wiki. Materials from previous courses and more.
University of Innsbruck GPU-programming using OpenCL by Juan J. Durillo PhD.
University of Waterloo Programming for Performance. Lecture notes and assignments.

Architectures

Berkely University Computer Architecture and Engineering

[/list1]

Videos

[list1]

AMD’s OpenCL introduction. Takes about an hour in total, slides are provided.
Harvard Lectures on GPGPU. One hour each.

[/list1]

Cases/Studies

[list1]

AMD optimisation case study: Diagonal Sparse Matrix Vector Multiplication .
AMD optimisation case study: Simple reductions.

[/list1]

WebCL

WebCL is a new standard-to-be for OpenCL in the browser. Currently there are a few implementations, while Khronos is working on an official standard. WebCL is available on Firefox for Linux32, Windows32 and Windows64 by Nokia. Also available for Safari on OSX by Samsung. A Node.js-implementation is made by Motorola. Examples made for another implementation will probably not work.

Tutorials:

[list1]

[/list1]

Check Khronos’ WebCL page for more resources.

C/C++

Basic knowledge of C is needed to understand how to write kernels. Also many tutorials are in C++.

[list1]

A little C primer.
C++ for Java-programmers.
C for Java-programmers.
http://www.stanford.edu/class/cs101/ for if you have never programmed – don’t think about GPU-programming yet.

[/list1]

Basic OpenGL

Getting a grasp of OpenGL has advantages. Techniques for faster memory-operations in OpenGL have equivalents in OpenCL, giving reason to read on this subject.

[list1]

Discussion about OpenGL Shaders.
OpenGL Samples from getting an empty screen till the famous teapot.

[/list1]

Mega-kernel versus Micro-kernels in LuxRender (repost)

Posted by Vincent Hindriksen on 4 November 2014

Below is a (slightly edited) repost of a blog by David Bucciarelli (homepage, twitter) on the Luxrender forum.

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program

into an equally large GPU kernel is generally not a good approach.

Due to SIMT execution model on GPUs, divergence in control flow

carries substantial performance penalties, as does high register us-

age that lessens the latency-hiding capability that is essential for the

high-latency, high-bandwidth memory system of a GPU. In this pa-

per, we implement a path tracer on a GPU using a wavefront formu-

lation, avoiding these pitfalls that can be especially prominent when

using materials that are expensive to evaluate. We compare our per-

formance against the traditional megakernel approach, and demon-

strate that the wavefront formulation is much better suited for real-

world use cases where multiple complex materials are present in

the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more

The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:

Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.

Windows on ARM

Posted by Vincent Hindriksen on 13 January 2011 with 1 Comment

In 2010 Microsoft got interested in ARM, because of low-power solutions for server-parks. ARM tried to lobby for years to convince Microsoft to port Windows to their architecture and now the result is there. Let’s not look to the past, why they did not do it earlier and depended completely on Intel, AMD/ATI and NVIDIA. NB: This article is a personal opinion, to open up the conversation about Windows plus OpenCL.

While Google and Apple have taken their share on the ARM-OS market, Microsoft wants some too. A wise choice, but again late. We’ve seen how the Windows-PC market was targeted first from the cloud (run services in the browser on any platform) and Apple’s user-friendly eye-candy (A personal computer had to be distinguished from a dull working-machine), then from the smartphones and tablets (many users want e-mail and a browser, not sit behind their desk). MS’s responses were Azure (Cloud, Q1 2010), Windows 7 (OS with slick user-interface, Q3 2009), Windows Phone 7 (Smartphones, Q4 2010) and now Windows 8 (OS for X86 PCs and ARM tablets, 2012 or later).

Windows 8 for ARM will be made with assistance from NVIDIA, Qualcomm and Texas Instruments, according to their press-release [1]. They even demonstrated a beta of Windows 8 running Microsoft Office on ARM-hardware, so it is not just a promise.

How can Microsoft support this new platform and (for StreamHPC more important) what will the consequences be for OpenCL, CUDA and DirectCompute.

Continue reading “Windows on ARM” →