OpenCL support on recent Android smartphones

There is more than one way (image by Pank Seelen
There is more than one way (image by Pank Seelen)

The embedded world is so extremely flexible, because it is full of open standards. We therefore expect that big processor vendors will push harder than Google can push back. OpenCL-support is very important for GPGPU-libraries like ArrayFire, VexCL, ViennaCL – these can be ported to Android in less time.

Apple now has introduced Metal on iOS to increase the fragmentation even more. StreamHPC and friends are working hard on getting one language to have on all platforms, so we can build on bringing solutions to you. Understand that if OpenCL gets popular on Android, this increases the chance that it will get accepted on other mobile platforms like iOS and Windows Mobile/Phone.

On the other hand it is getting blocked wherever it can, as GPGPU brings unique apps. A RenderScript-only or Metal-only app is good for sales of one type of smartphone – good for them, bad for developers who want to target the whole market.

Getting the current status

To get more insight on the current situation, Pavan Yalamanchili of ArrayFire has created a spreadsheet (click here to edit yourself). It is publicly editable, so anybody can help complete it. Be clear about the version of Android you are running, as for instance in 4.4.4 there are possibly some blocks thrown up by Google. If you found drivers, but did not get OpenCL running, please put that in the notes. You can easily find out if your smartphone supports OpenCL, using this OpenCL-Info app. Thanks in advance of helping out!

Why not just RenderScript?

We think that RenderScript can be built on top of OpenCL. This helps allowing new programming languages and finding the optimal programming-solution faster than just trusting Google engineers – solving this problem is not about being smart, but about being open to more routes.

Same is for Metal, which even tries to replace both OpenCL and OpenGL. Again it is a higher level language which can be expressed in OpenGL and OpenCL.

Let’s see if Apple and Google serve their dedicated developers, or if we-the-developers must serve them. Let’s hope for the best.

Why use OpenCL on FPGAs?

9781118942208.pdfAltera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here  from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just fill in the form on Altera’s webpage.

In StreamHPC we’re interested in OpenCL on FPGAs for one reason: many companies run their software on GPUs, when they should be using FPGAs instead; and at the same time, others stick to FPGAs and ignore GPUs completely. The main reason, we think, is that converting CUDA to VHDL, or Verilog to CPU intrinsics, is simply too painful. Another reason can be seen in the a amount of investment put on a certain technology. We believe that OpenCL can solve both of these issues. OpenCL is much more portable and can be converted to a new architecture in a relatively short time (if the developer is familiar with the project, the hardware and OpenCL). We have high familiarity with these two latter, which means we’re used to get new projects up-and-running.

Since both Altera and Xilinx have invested in OpenCL, the two FPGAs code has become more portable now. Altera has a public SDK (and they’re proudly loud about it), while Xilinx offers it in their latest tools (although they’re unfortunately much more silent about it).

Now, let us now go back to the quotes from the book that we wanted to share with you.

Andrew Moore describes OpenCL effectively in just a few sentences:

The need for heterogeneous computing is leading to new programming languages to exploit the new hardware. One example is the OpenCL first developed by Apple, Inc. OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, DSPs, FPGAs, and other types of processors. OpenCL includes a language for developing kernels (functions that execute on hardware devices) as well as application programming interfaces (APIs) that define and control the various platforms. OpenCL allows for parallel computing using task-based and data-based parallelism.

The author also shares some interesting insights around the reasons why OpenCL should be used on FPGA:

FPGAs are inherently parallel, so they’re a perfect fit with OpenCL’s parallel computing capabilities. FPGAs give you an alternative to the typical data or task parallelism by offering a pipeline parallelism where tasks can be spawned in a push-pull configuration with each task using different data from the previous task with or without host interaction. OpenCL allows you to develop your code in the familiar C programming language but using the additional capabilities provided by OpenCL. These kernels can be sent to the FPGAs without your having to learn the low-level HDL coding practices of FPGA designers. Generally, there are several benefits for software developers and system designers to use OpenCL to develop code for FPGAs:

  • Simplicity and ease of development: Most software developers are familiar with the C programming language, but not low-level HDL languages. OpenCL keeps you at a higher level of programming, making your system open to more software developers.
  • Code profiling: Using OpenCL, you can profile your code and determine the performance-sensitive pieces that could be hardware accelerated as kernels in an FPGA.
  • Performance: Performance per watt is the ultimate goal of system design. Using an FPGA, you’re balancing high performance in an energy-efficient solution.
  • Efficiency: The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives.
  • Heterogeneous systems: With OpenCL, you can develop kernels that target FPGAs, CPUs, GPUs, and DSPs seamlessly to give you a truly heterogeneous system design.
  • Code reuse: The holy grail of software development is achieving code reuse. Code reuse is often an elusive goal for software developers and system designers. OpenCL kernels allow for portable code that you can target for different families and generations of FPGAs from one project to the next, extending the life of your code.

Today, OpenCL is developed and maintained by the technology consortium Khronos Group. Most FPGA manufacturers provide Software Development Kits (SDKs) for OpenCL development on FPGAs.

You can continue here if you want to read of this ebook. And  of course, whenever you want to learn some more more, feel free to write to us, or follow this conversation on Twitter, which goes on through our special account: @OpenCLonFPGAs.

Supported university projects

At StreamHPC we support the following projects:

“Graphitti”

Massively Parallel Large-Scale Graph Processing on Modern Architectures.

VENI 12480, by Ana Luca Verbanascu, from Delft university, Netherlands.

Vincent Hindriksen is on the User Committee.

“Programabilidad y Eficiencia Energética de Sistemas Heterogéneos y de Altas Prestaciones”

Programmability and Efficiency of Heterogeneous and High Performance Systems

By Dr. Francisco Almeida Rodríguez and Alejandro Acosta, from University of La Laguna, Spain.

Other collaborations

We’re open to collaborations, when we see true value of your project. We bring industry experience, help with performance engineering (GPGPU, OpenCL, embedded, FPGA, etc) and exposure to the market after finalisation.

University-RUG-Groningen-Holland_550

 

AMD now leads the Green500

green500With SC14 behind us, there are a few things I’d like to share with you. I’d like to start with the biggest win for OpenCL: AMD leading in the most power-efficient GPU-cluster.

A few months ago I wrote a theoretical article on how to build the cheapest and greenest supercomputer to enter the Top500 and Green500. There I showed that AMD would theoretically win on both GFLOPS/costs and GFLOPS/Watt. Last week I learned a large cluster is actually being built in Germany, which now leads the Green500 (GFLOPS/Watt). It is powered by Intel Ivy Bridge CPUs, an FDR Infiniband network and accelerated by air-cooled(!) AMD FirePro S9150 GPUs, as can be seen on the Green 500 report of November. The score: 5.27 GFLOPS per Watt, mostly because of AMD’s surprise act: extremely efficient SGEMM and DGEMM.

green5

The first NVIDIA Tesla-based system on the list is at #3 with 4.45 GFLOPS per Watt for a liquid cooled system. If the AMD FirePro S9150 would be oil or water cooled, the system could go to over 6 GFLOPS per Watt. I’m expecting such system on the Green500 of June. The PEZY-SC (#2 on the list) is a very interesting, unexpected newcomer to the field – I’ll share more with you later, as I heard it supports OpenCL.

The price metric

The cluster at GSI Helmholtz Center has around 1.65 double precision PetaFlops (theoretical). Let’s do the same calculation as with the 150 GFLOPS system using the latest prices, only taking the accelerator part.

640 x AMD FirePro S9150.

  • 2.53 GFLOPS * 640 = 1.62 TFLOPS (I rounded down to 2.0 GFLOPS in the other article)
  • US$ 3300. Total price: $2.112M. Price per TFLOPS: $1.304M
  • 235 Watt * 640 = 150 kWatt (excluding network, CPU, etc)

640 x NVIDIA Tesla K40x

  • 1.42 GFLOPS * 640 = 0.91 TFLOPS
  • US$ 3160 (got down a lot due to introduction K80!). Total price: $2.022M. Price per TFLOPS: $2.225M
  • 235 Watt * 640 = 150 kWatt

640 x Intel XeonPhi 7120P

  • 1.21 GFLOPS * 640 = 0.65 TFLOPS
  • US$ 3450. Total price: 2.208$M. Price per TFLOPS: $3.397M
  • 300 Watt * 640 = 192 kWatt

So it’s pretty clear, why GSI chose AMD: $92M or $209M less costs for the same GFLOPS. Also note that more GFLOPS per accelerator is important to lower overhead.

What to expect from June’s Green500

Next year Nvidia probably comes with Maxwell, which probably will do very well in the Green500. Intel has their new XeonPhi, but it’s a very new architecture and no samples have arrived yet – I would be surprised, as they over-promised for too long now. Besides bringing surprises, Intel’s other strengths are its vast collaborations and strong fanbase – the past years I heard the most ridiculous responses on why such underperforming accelerator was chosen instead of FirePro or Tesla, so it’s certainly aiming for a rampage (based on hope). AMD did not enclose any information on a new version of the S9150 (Something like S9200 or S9250).

Then there are the dual GPUs, which have no advantages but lower energy-usage. The K80 just arrived, but the number don’t add up yet – we’ll have to see when the samples arrive. AMD did not say anything about the next version of the S10000, but probably arrives next year – no ETA. Intel did not do dual-chip cards until now. These systems can be built more compact, as 4 GPUs per system is becoming a standard.

Another important change will be the CPUs with embedded CPU being used in the clusters, where now mostly Intel Xeons rule the world. Intel’s Iris Pro line and AMD new Carrizo APU could certainly get more popular, as more complex code can be accelerated very well by such processors. Also 64-bit ARM-processors we’ll see more – hopefully with GPU. This subject I’ll handle in a separate article, as OpenCL could be a big enabler for easy offloading.

Based on the current information I have available, Nvidia aims for Maxwell based Teslas, AMD with S9150 and the dual-GPU variant, Intel with none (aiming for November 2015). It’ll be exciting to see HPC get to 6+ GFLOPS/Watt as a standard – I find that more important than building the biggest cluster.

OpenCL will help select hardware from that year’s winner, not being locked in to that year’s loser. Meanwhile at StreamHPC we will keep building OpenCL-based software, to help our customers pick that winner.

How to introduce HPC in your enterprise

eviljaymz-spare-time
Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

  • Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
  • NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
  • AMD: For the S9150 get a W9100.
  • Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the  results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

P_setting_fff_1_90_end_500.pngIf your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

We sponsor HiPEAC again this year

HiPEAC is an academic oriented, 3-day, international conference around HPC, compilers and processors. Last year was in Vienna, this year in Amsterdam – where StreamHPC also is based. HiPeacThat was an extra reason to go for silver sponsorship, besides I find this conference very important.

Compilers have the job to do magic. Last year I had nice feedback on my request to give the developers feedback where in the code the compiler struggles – effectively slapping the guy/gal instead of trying to solve it with more magic. Also learned a lot about compilers in general, listened to GPGPU-talks, discussed about HPC, and most of all: met a lot of very interesting people.

Why you should come too? I give you five reasons:

  • Learn about compilers and GPU-techniques, in depth.
  • Have great discussions about the latest and greatest research, before it’s news.
  • Meet great people who create the compilers you use (or the reverse).
  • Visit Amsterdam, Netherlands – I can be your guide. Flights are cheap.
  • Only spend €400 for the full 3 day programme and a unique dinner with 500 people – compare that to SC14 and GTC!

If you are seeking for a job in HPC, compilers and GPGPU, you should really come over. We’re there, but several other sponsors are also looking for new employees too.

See the tracks at HiPEAC, which has a lot more GPU-oriented talks than last year. I selected a few from the list in bold.

Monday

  • Opening address
  • William J. Dally, Challenges for Future Computing Systems
  • Euro-TM: Final Workshop of the Euro-TM COST Action
  • Session 1. Processor Core Design
  • CS²: Cryptography and Security in Computing Systems
  • IMPACT: Polyhedral Compilation Techniques
  • MCS: Integration of mixed-criticality subsystems on multi-core and manycore processors
  • EEHCO: Energy Efficiency with Heterogeneous Computing
  • INA-OCMC: Interconnection Network Architecture: On-Chip, Multi-Chip
  • WAPCO: Approximate Computing
  • SoftErr: Mitigation of soft errors: from adding selective redundancy to changing the abstraction stack
  • Session 2. Data Parallelism, GPUs
  • James Larus, It’s the End of the World as We Know It (And I Feel Fine)
  • ENTRE: EXCESS & NANOSTREAMS
  • SiPhotonics: Exploiting Silicon Photonics for energy-efficient high-performance computing
  • HetComp: Heterogeneous Computing: Models, Methods, Tools, and Applications
  • Session 3. Caching
  • Session 4. I/O, SSDs, Flash Memory
  • Student poster session / Welcome reception

Tuesday

Don’t forget to meet us at the industrial poster-sessions.

  • Rudy Lauwereins, New memory technologies and their impact on computer architectures
  • Thank you HiPEAC
  • Session 5. Emerging Memory Technologies
  • EMC²: Mixed Criticality Applications and Implementation Approaches
  • ADEPT: Energy Efficiency in High-Performance and Embedded Computing
  • MULTIPROG: Programmability Issues for Heterogeneous Multicores
  • WRC: Reconfigurable Computing
  • TISU: Transfer to Industry and Start-ups
  • HiStencils: High-Performance Stencil Computations
  • MILS: Architecture and Assurance for Secure Systems
  • Programmability: Programming Models for Large Scale Heterogeneous Systems
  • Industrial Poster Session
  • INNO2015: Innovation actions in Advanced Computing CFP
  • Session 6. Energy, Power, Performance
  • DCE: Dynamic Compilation Everywhere
  • EUROSERVER: Green Computing Node for European Micro-servers
  • PolyComp: Polyhedral Compilation without Polyhedra
  • HiPPES4CogApp: High-Performance Predictable Embedded Systems for Cognitive Applications
  • Industrial Session
  • Session 7. Memory Optimization
  • Session 8. Speculation and Transactional Execution
  • Canal tour / Museum visit / Banquet

Wednesday

  • Burton J. Smith, Resource Management in PACORA
  • HiPEAC 2016
  • Session 9. Resource Management and Interconnects
  • PARMA-DITAM: Parallel Programming and Run-Time Management Techniques for Many-core Architectures + Design Tools and Architectures for Multi Core Embedded Computing Platforms
  • ADAPT: Adaptive Self-tuning Computing System
  • PEGPUM: Power-Efficient GPU and Many-core Computing
  • HiRES: High-performance and Real-time Embedded Systems
  • RAPIDO: Rapid Simulation and Performance Evaluation: Methods and Tools
  • MemTDAC: Memristor Technology, Design, Automation and Computing
  • DataFlow, Computing in Space: DataFlow SuperComputing
  • IDEA: Investigating Data Flow modeling for Embedded computing Architectures
  • TACLe: Timing Analysis on Code-Level
  • EU Projects Poster Session
  • Session 10. Compilers
  • HIP3ES: High Performance Energy Efficient Embedded Systems
  • HPES: High Performance Embedded Systems
  • Session 11. Concurrency
  • Session 12. Methods (Simulation and Modeling)

Hopefully till then!

8 reasons why SPIR-V makes a big difference

From all the news that came out of GDC, I’m most eager to talk about SPIR-V. This intermediate language spir-vwill make a big difference for the compute-industry. In this article I’d like to explain why. If you need a technical explanation of what SPIR-V is, I suggest you first read gtruc’s article on SPIR-V and then return here to get an overview of the advantages.

Currently there are several shader and c ompute languages, which SPIR-V tries to replace/support. We have GLSL, HLSL for graphics shaders, SPIR (without the V), OpenCL, CUDA and many others for compute shaders.

If you have questions after reading this article, feel free to ask them in a comment or to us directly. Continue reading “8 reasons why SPIR-V makes a big difference”

OpenCL

StreamHPC is best-known for its OpenCL services, including development. We have hit records of over 250’000 times in optimising code using OpenCL. During those projects several techniques have been used to get to these high numbers – well-designed projects we can still speed-up 2 to 8 times.

Advantages

OpenCL works on most types of hardware of any language. Compare it to C and C++, which is used to program all kinds of software on very different hardware.

The basic OpenCL is used to make portable software that performs. The advanced OpenCL is used to optimise for maximum performance for specific accelerators.

Hardware

Projects can be targeting one or more operating systems, focusing on one or several processors:

  • CPUs, by Intel, AMD and/or ARM
  • NVidia GPUs
  • AMD GPUs
  • Embedded GPUs
    • Vivante
    • ARM MALI
    • Imagination
    • Qualcomm
  • Altera FPGAs
  • Xilinx FPGAs
  • Several special focus processors, mostly implementing a subset of OpenCL.

We use modern coding techniques to make code for maximum flexibility and maximum performance.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

4 common questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

The knowns and unknowns of the PEZY-SC accelerator at RIKEN

PEZY-SC_QuadPCB-1_smallThe green500 is out and one unknown processor takes the number one position with a huge improvement over last year. It is a new super-computer installed at RIKEN with an incredible 7 GFLOPS/Watt. It is powered by the processor-boards at the right: two Xeons, 4 PEZY-SC 1.4 accelerators and 128GB DRAM, which have a combined performance of about 6.2 TFLOPS. It has been designed for immersive cooling.

The second and third positions are also powered by the PEZY-SC, before we find the winner of last year: the AMD FirePro S9150 and a bit after that the rest (mostly NVidia Tesla). One constant is the CPUs used: Intel XEON is taking most. To my big surprise no ARM64.

green500_2015june_top5

From the third to the first PEZY-SC installation there is an improvement of 13%. It seems the first two are the new type, called “bricks”, while the third is the same as last year. Comparing with that super from last year (4.4945 GFLOPS/W) there is an improvement of 42% and 25%. The 13% improvement from the previous version is interesting enough, but the 25% improvement on exactly the same system raised questions. Probably it is due to compiler-optimisations. As the November-version of the Green500 is much more strict, it will be clear if the rules were bent – let’s hope it’s for real!

It supports OpenCL!

When new accelerators support OpenCL, it gets accepted more easily. So it is very interesting the PEZY-SC runs on OpenCL. I asked at ISC and got explained it was a subset of OpenCL, but could not get the finger on which subset, nor could I get access to test it. It does mean that code that would run well on this machine is easy to port. And then I mean the same “easy” Intel uses for explaining the easyness of porting OpenMP software to XeonPhi: PEZI-specific optimisations and writing around the missing functionality would still take effort – the typical stuff we do at StreamHPC.

RIKEN Shoubu

Some information on “Shoubu” (“Iris” in Japanese), the top 1 on the Green 500. According to the Green500 it is 353.8 TFLOPS (based on 50kW, using an actual benchmark). On 25 June RIKEN announced the Shoubu is 2 PFLOPS (theoretical). If the full machine is used for the Green500, then the efficiency was only 18%!

Below are some images of the installation.

shoubu2  shoubu3  shoubu1

Source: http://www.exascaler.co.jp/wp-content/uploads/2015/06/20150625.pdf

An important part is Exascaler’s immersion technology, what I understood is a spin-off of PEZY. I’m very curious what the AMD FirePro S9150 does when it uses immersion-cooling – I think we have to do some frying at the office to find out.

PEZY-SC1.4 and PEZY-SC2

PEZY started with a multi-core processor of 512 cores, the PEZY-1. The PEZY-SC has 1024 cores and has had a few gradual upgrades – currently PEZY-SC 1.4 (“the brick”) is installed.

PEZY-SC Specification:

Logic Cores(PE) 1,024
Core Frequency 733MHz
Peak Performance Floating Point. Single 3.0TFlops / Double 1.5TFlops
Host Interface PCI Express GEN3.0 x8Lane x 4Port (x16 bifurcation available)
JESD204B Protocol support
DRAM Interface DDR4, DDR3 combo 64bit x 8Port Max B/W 1533.6GB/s
+Ultra WIDE IO SDRAM (2,048bit) x 2Port Max B/W 102.4GB/s
Control CPU ARM926 dual core
Process Node 28nm
Package FCBGA 47.5mm x 47.5mm, Ball Pitch 1mm, 2,112pin

Source: http://pezy.co.jp/en/products/pezy-sc.html

Development on PEZY-SC2 is ongoing, which will have a staggering 4096 cores. Ofcourse efficiency has to go up (if the 18% is correct), to make this a good upgrade.

There is no promise on when the PEZY-SC2 will be announced, but it will certainly surprise us again hen it arrives.

“Soon we will use only one thousandth of available computer capacity”

Professor Henri Bal
Professor Henri Bal, who tries to wake up the Netherlands to start going big on parallel programming

At StreamHPC we mostly work for companies in the bigger countries of Europe and North America. We hardly work for companies in the Netherlands. But it seems that after 5 years of sleeping, there is some shaking. Below is a (translated) article with the above quote by Prof. Dr. Ir. Henri Bal, professor at the Computer section at the Vrije University of Amsterdam.

Lack of knowledge of parallel programming will cause a situation where only one thousandth of the capacity of computers will be used. This makes computations unnecessarily slow and inaccurate. That in turn will slow down the development of the Dutch knowledge economy.

Sequential programming, instructing computers to perform calculations in a queue, is now the standard. Computers processors, however, are much more sophisticated and able to perform thousands or even millions of computations simultaneously. But the programming of such many-cores “is still in its infancy, industries that rely heavily on data, can not perform optimally”, claims Ball.

The value of parallel programming, according to Ball, is of enormous importance, for example, meteorology and forensics. “For weather forecasting data from the dense network of computers need to be quickly and accurately processed to have a weather forecast for tomorrow, not after 48 hours,” he says. “In forensics all data should be explored in the first 24 hours after a crime as soon as possible and through pattern recognition all data, for no trace to be lost. The video material of 80,000 security cameras which was manually searched through after the attack on the London Underground in 2005 – with parallel computing methods this can now rapidly be executed by the computer.”

If the Netherlands wants to widen the gap investments are necessary, says Bal. The focus should be on research and teaching. “Investments in research on programming new massively-parallel machines are required to gain knowledge. Thus it must be examined how programs should be written for parallel computing methods and what extent of parallel calculations can be performed automatically. In teaching our future programmers need also to be prepared for the new standards of parallel programming. Only then the Netherlands can make optimal use of the available computer capacity. “

I think my fellow countrymen will be surprised they can find help just around the corner. And if they wait two more years, then 1000x speed-up from sequential programs are indeed becoming possible.

Have you seen similar articles that sequential programming is slowing the knowledge economy?

How to do Approximation Computing in OpenCL

approximation_computing
Most processors have been on the plane of low error.

Approximation computing is allowing larger errors when performing calculations.  While most programmers might go the other way (lower error rate) by using doubles for instance, this field is interesting for quite some of us. The reason is that you can get more performance, more bandwidth space and lower power usage in return.

In Neural Networks high precision is not required, but also Big Data approximation computing is very useful. Most important is that you actually think of the possibility to trade in precision when designing your OpenCL software. For example, does your window function need to be very precise or can there be rounding errors? Or do you do iterative steps (more precision needed), or calculate relatively from the starting point (less precision needed)? You can even use relatively more expensive algorithms that compensate with a smaller overall error. Here at StreamHPC we think this through as one of the main optimisation techniques.

Let’s look into what is possible in OpenCL and which is the hardware support. Continue reading “How to do Approximation Computing in OpenCL”

Handling OpenCL with CMake 3.1 and higher

CMake-logoThere has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

  • make a build-directory
  • cd build
  • cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

  • Ubuntu/Debian: Get the PPA.
  • Other Linux: Get the latest tar.gz and compile.
  • Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

Call for papers: SYCL workshop, 13-March-2016, Barcelona, Spain

33d9e1_e784b7_SYCL_Color_Mar14A high-level language has been on OpenCL’s roadmap since the years, and would be started once the foundations were ready. Therefore with OpenCL 2.0, SYCL was born.

To keep the pace high, a SYCL workshop is being organised. This week the call-for-papers is opened, which you can read below.

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Barcelona, Spain Sunday, 13th March, 2016

SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction
layer that builds on the underlying concepts, portability and efficiency of
OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL
enables single source development where C++ template functions can contain both
host and device code to construct complex algorithms that use OpenCL
acceleration, and then re-use them throughout their source code on different
types of data. SYCL has also been designed with resilience from the start, by
featuring, for example, a fall-back mechanism to automatically re-enqueue
kernels on different queues in case of a failure.

The SYCL Workshop aims to gather together SYCL’s users, researchers, educators
and implementors to encourage and grow a community of users behind the SYCL
standard, and related work in C++ for heterogeneous architectures. This will be
a half-day workshop. SYCL’16 will be held in Barcelona, 13 March 2016,
co-located with PPoPP 2016, HPCA 2016, CGO 2016 and LLVM 2016.

Travel Awards

Student authors who present papers in this workshop are eligible to apply for
travel awards. Further details will be announced after notification of
acceptance.

Important Dates

Submissions: 23rd November
Notification: 21st December
Final version: 24th January, 2016
Workshop: Sunday, 13th March, 2016

Submission Guidelines

All submissions must be made electronically through the conference submission
site, at https://easychair.org/conferences/?conf=sycl16.
Submissions may be one of the following:

  • Extended abstract: Two pages in standard SIGPLAN two-column conference
    format (preprint mode, with page numbers)
  • Short Paper: Four to six pages in standard SIGPLAN two-column conference
    format (preprint mode, with page numbers)

Submissions must be in PDF format and printable on US Letter and A4 sized
paper. All submissions will be peer-reviewed by at least two members of the
program committee. We will aim to give longer presentation slots to papers than
to extended abstracts. Conference papers will not be published, but made
available through the website, alongside the slides used for each presentation.
The aim is to enable authors to get feedback and ideas that can later go into
other publications. We will encourage questions and discussions during the
workshop, to create an open environment for the community to engage with.

Topics of interest include, but are not limited to:

  • Applications implemented using SYCL
  • C++ Libraries using SYCL
  • C++ programming models for OpenCL (C++AMP, Boost.Compute, …)
  • Other C++ applications using OpenCL
  • New proposals to the SYCL specification
  • Integration of SYCL with other programming models
  • Compilation techniques to optimise SYCL kernels
  • Performance comparisons between SYCL and other programming models
  • Implementation of SYCL on novel architectures (FPGA, DSP, …)
  • Using SYCL in fault-tolerant systems
  • Reports on SYCL implementations
  • Debuggers, profilers and tools

Organising Committee

Paul Keir, University of the West of Scotland (UK)
Ruyman Reyes, Codeplay Software Ltd, Edinburgh (UK)

Program Committee

Jens Breitbart, TU Munich
Alastair Donaldson, Imperial College London, UK
Christophe Dubach, University of Edinburgh, UK
Joel Falcou, LRI, Université Paris-Sud, France
Benedict Gaster, University of the West of England, UK
Vincent Hindriksen, StreamHPC, Netherlands
Christopher Jefferson, St. Andrews University, UK
Ronan Keryell, Xilinx, Ireland
Zoltán Porkoláb, ELTE, Hungary
Francisco de Sande, Universidad de La Laguna, Spain
Ana Lucia Varbanescu, University of Amsterdam, Netherlands
Josef Weidendorfer, TU Munich

Yes, we’re in the Program Committee as one of the few non-academics. We’re looking forward to read your proposal!

If you have a blog, feel free to copy the above text and repost it.

OpenCL at SC15 – the booths to go to

SC15This year we’re unfortunately not at SuperComputing 2015 for reasons you will hear later. But we haven’t forgotten about the people going and trying to find a share of OpenCL. Below is a list of companies having a booth at SC15, which was assembled by the guys of IWOCL and we completed with some more background information.

Khronos

The first place to go to is booth #285 and meet Khronos to hear where to go at SC15 to see how OpenCL has risen over the years. More info here. Say hi from the StreamHPC team!

OpenCL on FPGAs

Altera | Booth: #462. Expected to have many demos on OpenCL. See their program here. They have brought several partners around the floor, all expecting to have OpenCL demos:

  • Reflex | Booth: #3115.
  • BittWare | Booth #3010.
  • Nallatech | Booth #1639.
  • Gidel | Booth #1937.

Xilinx | Booth: #381. Expected to show their latest advancements on OpenCL. See their program here.

Microsoft | Booth: #1319. Microsoft Bing is accelerated using Altera and OpenCL. Ask them for some great technical details.

ICHEC | Booth #2822. The Irish HPC centre works together with Xilinx using OpenCL.

Embedded OpenCL

ARM | Booth: #2015. Big on 64 bit processors with several partners on the floor. Interesting to ask them about the OpenCL-driver for the CPU and their latest MALI performance.

Huawei Enterprise | #173. Recently proudly showed the world their OpenCL capable camera-phones, using ARM MALI.

HPC OpenCL

Below are the three companies that promise at least 1 TFLOPS DP per co-processor.

Intel | Booth: #1333/1533. Where they spoke about OpenMP and forgot about OpenCL, Altera has brought them back. Maybe they share some plans about Xeon+FPGA, or OpenCL support for the new XeonPhi.

AMD | Booth: #727. HBM, HSA, Green500, HPC APU, 32GB GPUs and 2.2 TFLOPS performance – enough to talk about with them. Also lots of OpenCL love.

NVidia | Booth: #1021. Every year they have been quite funny when asked about why OpenCL is badly supported. Please do ask them this question again! Funniest answer wins something from us – to be decided.

Others

You’ll find OpenCL in many other places.

ArrayFire | Booth #2229. Their library has an OpenCL backend.

IBM | Booth: #522. Now Altera joined Intel, IBM’s OpenPower has been left with NVidia for accelerators. OpenCL could revive the initiative.

NEC | Booth: #313. The NEC group has accelerated PostgreSQL with OpenCL.

Send your photos and news!

Help us complete this post with news and photos, to complete this post. We’re sorry not to be there this year, so we need your help to make the OpenCL party complete. You can send via email, twitter and in the comments below. Thanks in advance!

Rant: No surprise there’s a shortage of good GPU-developers

notyetanothergraphicsAPI
Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny. Continue reading “Rant: No surprise there’s a shortage of good GPU-developers”

Medical Technology

Memorial Sloan-Kettering Cancer Center-logoOur expertise in parallel image processing is ideally suited for meeting the large computing demands in modern medical imaging. High-resolution microscopy images can easily take several GB in size, and high-content screening of microscopy data sets using classical software tools is extremely time-consuming. Using GPU-based parallel computing solutions, we can dramatically cut down processing and waiting times. For example, we have helped the Memorial Sloan Kettering Cancer Center by improving a tool they use daily. Where their analysis previously took one hour, it now takes just two minutes – a speed-up of 30x. Their productivity has gone up at virtually no extra cost as waiting for the results is significantly reduced, without the need to buy new computers.

genomicsStreamHPC also has experience in high-performance implementations of molecular dynamics simulation software. Such software can assist proteomics research in the prediction of protein structure and aid computation drug design in a more rapid drug discovery.

Computer Vision

Face_detectionComputing demands in computer vision are high, and often real-time processing with low latency is desirable. Computer vision can greatly benefit from parallelization as higher processing speeds can improve object recognition rates while FPGA solutions may reduce energy demands or support the perception of lag-free processing. At StreamHPC, we have supported several customers in optimizing their software to work on a lower power budget and on a higher speed. We can support you in dedicated solutions based on GPUs or FPGAs to meet your demands.

Molecular Dynamics

Penicillin-nucleus-3D-balls
Penicillin

StreamHPC has carried out several successful parallel-computing projects in molecular dynamics since 2012. Below are a few examples of our work in bioinformatics, chemistry and meteorology.

GROMACS does soft matter simulations on molecular scale

GROMACS is one of the fastest molecular dynamics software packages on the market. To broaden the user base that can benefit from the processing power of modern GPUs, we ported GROMACS from CUDA to OpenCL and further optimised the code for use with AMD FirePro accelerators. The resulting performance is on a par with that of the original CUDA code but without the restriction of being bound to a specific parallel computing hardware. GROMACS is used world-wide by over 5000 research centers, from simulating molecular docking to examining the hydrogen bonds in a falling water drop. Read more…

stanford_chemistry_logoFor the university of Stanford, we further optimised a part of TeraChem, a general purpose quantum chemistry software designed to run on NVIDIA GPU architectures. Our work resulted in adding an extra 70% performance to the already optimised CUDA code.

UniOfManchesterLogo.svgFor the University of Manchester, we developed a high-performance implementation of the UNIFAC group contribution model for their research on atmospheric aerosol particles. Where an OpenMP implementation of the original single-threaded code got the run time down from 32 to about 10 seconds on a quad-core CPU, we eventually brought it down to 0.062 seconds using OpenCL on a Xeon Phi accelerator – a speedup of 160x over OpenMP. Read more…

Embedded

embeddedEmbedded is an industry often combined with Image Processing, Computer Vision or Machine Learning. The goal is to have performance computing on batteries.

At StreamHPC we often have helped speed-up algorithms, as faster software also means the same work with less power-usage.

See under the tab “low power” under the menu “technologies” what hardware architectures we master.

Heterogeneous Systems Architecture – memory sharing and task dispatching

HSA-logoWant to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.

Back in 2012 the goals for HSA were high. The group tried to design a system where CPU and GPU would work together in an efficient way. In the 2013/2014 time-frame you’ll find lots of articles around the web, including on our blog, describing the capabilities of HSA. Unfortunately with the 1.0 specifications most terminologies have been changed.

In March 2015 the HSA Foundation released the final 1.0 specifications. It does not discuss hUMA (Heterogeneous Uniform Memory Access) nor hQ (Heterogeneous Queuing). These two techniques had undergone so many updates, that new terminologies were used.

In this blog post, we’ll present you an updated description of the two most important problems tackled by HSA: memory sharing and task dispatching.

We’ll be tuning the below description, so feedback is always welcome – focus is on clarity, not on completeness.

What is an HSA System?

Where the original HSA goals focused more on SoCs with CPU and GPU cores, now any compute core can be used. The reason was that modern SoCs are much more complex than just a CPU and GPU – integrated DSPs and video-decoder are found on many processors. HSA thus now (officially) supports truly heterogeneous architectures.hsa_mem_arch_3

The idea is that any heterogeneous processor can be designed by the principles of HSA. This will bring down design costs and enable more exotic configurations from different vendors.

And interesting fact about the HSA-specifications is that it only specifies goals, not how it must be implemented. This makes it possible to implement the specifications in software instead of hardware, making it possible to upgrade older hardware to HSA.

Why is HSA important?

A simple question: “will there be more CPUs with embedded GPU or discrete GPUs?”. A simple answer: “there are already more integrated GPUs than discrete ones”. HSA defines those chips with mixed processors.

CPUs with embedded GPUs used to be not much more than the discrete GPUs with shared memory we know from cheap laptops in the 00’s. When the GPU got integrated, each vendor started to create solutions for inter-processor dispatching (threading extended to heterogeneous computing), course-grained sharing (transferring ownership between processor units) and fine grained sharing (atomics working with all processor units).

The HSA Foundation

Sometimes an industry makes bigger steps by competing and sometimes by collaborating

AMD recognised the need for a standard. As AMD wanted to avoid the problems with introducing 64 bit into X86 and therefore initiated the HSA foundation. The founding members are AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. NVidia and Intel are awkwardly absent.

Memory Sharing

HSA uses a relaxed memory model, which has full memory coherence (data guaranteed to be the same for all processes on all cores) and is pageable (subsets can be reserved by programs).

The below write-up is heavily simplified to give an overview how memory sharing is designed under HSA. If you want to know more, read chapter 5 from the HSA book.

Inter-processor memory-pointer sharing – Unified Addressing

The most important part is the unified memory model (previously referred to as “hUMA”), which makes programming the memory-interactions in a heterogeneous processor with CPU-cores, GPU-cores and DPS-cores comparable to a multi-core CPU.

Like other modern memory models, HSA defines various segments, including global, shared and private. A difference is that flat addressing is used. This means that each address pointer is unique: you don’t have an address 0 for private and an address 0 for global. Flat addressing simplifies optimisation operations for higher level languages. Ofcourse you still need to be aware that each segment size is limited and there will be consequences when defining larger memory chunks than is available in the segment.

When you have created a memory object and want the DSP or GPU continue to work on it, then you can use the same pointers without any translations.

Inter-processor cache coherency

In HSA-systems global memory is coherent without the need for explicit cache maintenance. This means that local caches are synchronised and/or that caches are shared. For more information, read this blog from ARM.

Fine grained memory – Atomic Operations

HSA allows protecting memory segments to be atomicly accessed. This makes it possible to have multiple threads running on different cores of different processor units, all accessing the same memory in a safe manner.

Small and large consequtive memory segments can be reserved for sharing, from very fine to coarse grained. All threads that have access to that segement are notified when atomic operations are done.

Fine Grained Shared Virtual Memory (HSA compatibility for discrete GPUs)

AMD has done some efforts to extend HSA to discrete GPUs. We’ll see the real advantages with dispatching, but it also works to create a cleaner memory management.

The so called “Fine Grained Shared Virtual memory” makes it possible use HSA with discrete GPUs that have HSA-support. Because it’s virtual and data is continuously transferred between GPU and the HSA-processor, the performance is ofcourse lower than when using real shared memory. You can compare it to NVidia’s Unified Virtual Memory, and it also has been planned to be in OpenCL 2.0 for a long time.

Dispatching

HSA defines in detail how a task gets into the queue of a worker thread. Below is an overview of how queues, threads and tasks are defined and are named under HSA.

Queueing

Before HSA 1.0 we only spoke of “Heterogeneous Queue” (hQ). This is now further developed to “User Mode Queues”. A User Mode Queue holds the list of tasks for that specific (group of) processor cores, resides in the shared memory and is allocated at runtime.

Such task is described in a language called “Architected Queueing Language” (AQL), and is called an “AQL package”.

Agents and Kernel Agents

HSA threads run on one or a group of processor cores. These threads are called “Agents” and come in two variations: normal Agents and Kernel Agents. A Kernel Agents is an Agent that has a User Mode Queue and can execute kernels that work on a segment of memory. A normal Agent doesn’t have a queue and can only execute simple tasks.

If a normal agent cannot run kernels, but can run tasks, then what can it actually do? Here are a few examples:

  • Allocate memory, or other tasks only the host can do.
  • Send back (intermediate) data to the host – for example progress indication.

If you compare to OpenCL, an agent is the host (which creates the work) and kernel agents are the kernels (which can issue new threads under OpenCL 2.0).

AQL packages: communicating dispatch tasks

There are different types of the AQL (Architected Queueing Language) packets, of which these are the most important:

  • Agent dispatch packet: contains jobs for normal agents.
  • Kernel dispatch packet: contains jobs for kernel agents.
  • Vendor-specific packet: between processors of the same vendor there can be more freedoms.

In most cases, we’ll be talking about kernel dispatch packages.

The Doorbell signal: low latency dispatching

HSA dispatching is extremely fast and power-efficient due to the implementation of a “doorbell”. The doorbell of an agent is signalled when a new tasks is available, making it possible to take immediate action. A problem in OpenCL is the high dispatch times for GPUs without a doorbell – up to the millisecond range, as we have measured. For HSA-enabled GPUs the response-time before a kernel starts running is in the microseconds range.

Context switching

Threads can move from one core to another core – the task will be removed from the current queue and added to another queue. This can even happen when the thread is in running state.

StreamHPC’s position

The solution simply works and makes faster code – we have done a large project with it last year.

It seems that almost the whole embedded processor industry believes in it. AMD (CPU+GPU), ARM (CPU+GPU), Imagination (GPU), Mediatek, Qualcomm (GPU), Samsung and Texas Instruments (DSP) are founders. Companies like Analog Devices, CEVA, Sony, VIA, S3, Marvell and Cadence have later joined the club. Important Linux clubs like Linaro and Canonical are also seen.

The system-on-a-chip only will get more traction, and we see HSA as an enabler. Languages like OpenCL and OpenMP can be compiled down to HSA, so it just takes switching the compiler. HSA-capable software can be written in a more efficient manner, as now can be assumed that memory can efficiently be shared and dispatching new threads is really fast.

OpenCL.org

opencl-logoLast year we bought OpenCL.org with the purpose to support the OpenCL community and OpenCL-focused companies. In january we launched the first community-project on the website, porting GEGL to OpenCL. See below for more info.

The knowledge section of our homepage will be moved to the OpenCL.org website, but still be maintained by us.

GEGL project

GEGL is a free/libre graph based image processing framework used by GIMP, GNOME Photos, and other free software projects.

In january 2016 we launched an educational initiative that aims to get more developers to study and use OpenCL in their projects. Within this project, up to 20 collaborators will port as many GEGL operations to OpenCL as possible.

The goal of this project is to seek a way for a group to educate themselves in OpenCL, while supporting an open source project. One of the ways is to gamify the porting by benchmarking the kernels and defining winners, and another way is to optimize kernels within StreamHPC to push the limits. Victor Oliveira, who wrote most of the OpenCL code in GEGL, joined the GEGL-OpenCL project to advise.

All work is being done on GitHub. The communication between participants is taking place in a dedicated Slack channel (invite-only).

Want to have a vote on what is the next porting project after GEGL? Vote here.