Theoretical transfer speeds visualised

There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right.  What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

Continue reading “Theoretical transfer speeds visualised”

Basic concepts: Function Qualifiers

19092053_m
Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification)))  void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.

Continue reading “Basic concepts: Function Qualifiers”

Engineering GPGPU into existing software

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software”

To think about: an OpenCL Wizard

A part of reaction on my earlier post was “VB Programmers span the whole range from hobbyists to scientists and professions. There is no reason that VB programmers should be left out of the loop in GPU programming. (…) CUDA and OpenCL are fundamentally lower level languages that do not have the same level of user-friendlyness that VB gives, and so they require a bit more care to program. (…)“. By selecting parts, I probably put the emphasis very wrong, but it brought me an idea: “how a Visual Basic Wizzard for OpenCL” would look like.

It is actually very interesting to think how to get GPGPU-power to a programmer who’s used to “and then and then and then”; how do you get parallel processing into the mind of an iterative thinker? Simplification is not easy!

By thinking about an OpenCL-wizard, you start to understand the bottlenecks of, and need for OpenCL initialisation.

Actually it could better be built into an existing framework like OpenMP (or something alike in Python, Java, etc) or the IDE could give hints that the standard-function could be replaced by a GPGPU-version. But we just want a wizard which generates “My First OpenCL Program”, which puts a smile on the face of programmers who use their mouse a lot.

Continue reading “To think about: an OpenCL Wizard”

OpenCL

StreamHPC is best-known for its OpenCL services, including development. We have hit records of over 250’000 times in optimising code using OpenCL. During those projects several techniques have been used to get to these high numbers – well-designed projects we can still speed-up 2 to 8 times.

Advantages

OpenCL works on most types of hardware of any language. Compare it to C and C++, which is used to program all kinds of software on very different hardware.

The basic OpenCL is used to make portable software that performs. The advanced OpenCL is used to optimise for maximum performance for specific accelerators.

Hardware

Projects can be targeting one or more operating systems, focusing on one or several processors:

  • CPUs, by Intel, AMD and/or ARM
  • NVidia GPUs
  • AMD GPUs
  • Embedded GPUs
    • Vivante
    • ARM MALI
    • Imagination
    • Qualcomm
  • Altera FPGAs
  • Xilinx FPGAs
  • Several special focus processors, mostly implementing a subset of OpenCL.

We use modern coding techniques to make code for maximum flexibility and maximum performance.

How OpenCL works

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main one. Also there is an extension for vectors (float2, short4, int8, long16, etc), because modern processors have support for that.

So for example you need to calculate Sin(x) of a large array of one million numbers. OpenCL detects which devices could compute this for you and gives some statistics of each device. You can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, but now you say something like: “Get me Sin(x) of each x in array A”. When finished, you take the data back from the device(s) and you are finished.

As the compute-devices can do more in parallel and OpenCL is better in describing independent functions, the total execution time is much lower than conventional methods.

4 common questions on OpenCL

Q: Why is it so fast?
A: Because a lot of extra hands make less work, the hundreds of little processors on a graphics card being the extra hands. But cooperation with the main processor keeps being important to achieve maximum output.

Q: Does it work on any type of hardware?
A: As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA.

Q: How does it compare to OpenMP/MPI?
A: Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities. There are several efforts to combine the two worlds.

Q: Does it replace C or C++?
A: No, it is an extension which integrates well with C, C++, Python, Java and more.

SC14 Workshop Proposals due 7 February 2014

Just to let you know that there should be even more OpenCL and related technologies on SC14

sc14emailheader

Are you interested in hosting a workshop at SC14?

Please mark your calendars as SC will be accepting proposals from 1 January – 7 February for independently planned full-, half-, or multi-day workshops.

Workshops complement the overall SC technical program. The goal is to expand the knowledge base of practitioners and researchers in a particular subject area providing a focused, in-depth venue for presentations, discussion and interaction. workshop proposals will be peer-reviewed academically with a focus on submissions that wil inspire deep and interactive dialogue in topics of interest to the HPC community.

For more information, please consult: http://sc14.supercomputing.org/program/workshops

Important Submission Info

Web Submissions Open: 1 January 2014
Submission Deadline: 7 February 2014
Notification of acceptance: 28 March 2014

Submit proposals via: https://submissions.supercomputing.org/
Questions: workshops@info.supercomputing.org

We’re thinking of proposing one too. Want to collaborate? Mail us! Don’t forget, to go to HiPEAC (20 January) and IWOCL (12 May) to meet us!

The Exascale rat-race vs getting-things-done with HPC

slide-12-638
IDC Forecasts 7 Percent Annual Growth for Global HPC Market – HPCwire

When the new supercomputer “Cartesius” of the Netherlands was presented to the public a few months ago, the buzz was not around FLOPS, but around users. SARA CEO Dr. Ir. Anwar Osseyran kept focusing on this aspect. The design of the machine was not pushed by getting into the TOP500, but by improving the performance of the current users’ software. This was applauded by various HPC experts, including StreamHPC. We need to get things done, not to win a virtual race of some number.

In the description about the supercomputer, the top500-position was only mentioned at the bottom of the page:

Cartesius entered the Top500 in November 2013 at position 184. This Top500 entry only involves the thin nodes resulting in a Linpack performance (Rmax) of 222.7 Tflop/s. Please note that Cartesius is designed to be a well balanced system instead of being a Top500 killer. This is to ensure maximum usability for the Dutch scientific community.

What would happen if you go for a TOP500 supercomputer? You might get a high energy bill and an overpriced, inefficient supercomputer. The first months you will not have full usage of the machine, and you won’t be able to easily turn off some parts, hence the spill of electricity. This results, finally, in that it is better to run unoptimized code on the cluster than to take time for coding.

The inefficiency is due to the fact that some software is data-transfer limited and other is compute-limited. No need to explain that if you go for a Top 500 and not for software optimized design, you end up buying extra hardware to get all kinds of algorithms performing. Cartesius therefore has “fat nodes” and “light nodes” to get the best bang per buck.

There is also a plan for expanding the machine over the years (on-demand growth), such that the users will remain happy instead of having an adrenaline-shot at once.

The rat-race

The HPC Top 500 is run by the company behind ISC-events. They care about their list being used, not if there is Exascale now or later. There is one company who has a particular interest in Exascale: Intel and IBM. It hardly matters anymore how it begun. What is interesting is that Intel has bought Infiniband and is collecting companies that could make them the one-stop shop for a HPC-cluster. IBM has always been strong in supercomputers with their BlueGene HPC-line. Intel has a very nice infographic on Intel+Exascale, which shows how serious they are.

But then the big question comes: did all this pushing speed up the road to Exascale? Well, no… just the normal peaks and lows round the logarithmic theoretic line:

Top500-exponential-growth
source: CNET

What I find interesting in this graph is that the #500 line is diverging from the #1 line. With GPGPU is would was quite easy to enter the top 500 3 years ago.

Did the profits rise? Yes. While PC-sales went down, HPC-revenues grew:

Revenues in the high-performance computing (HPC) server space jumped 7.7 percent last year to $11.1 billion surpassing the $10.3 billion in revenues generated in 2011, according to numbers released by IDC March 21. This came despite a 6.8 percent drop in shipments, an indication of the growing average selling prices in the space, the analysts said. (eWeek.)

So, mainly the willingness of buying HPC has increased. And you cannot stay behind when the rest of the world is focusing on Exascale, can you?

Read more

Keep your feet on the ground and focus on what matters: papers and answers to hard questions.

Did you solve a compute problem and got published with an sub-top250 supercomputer? Share it in the comments!

AMD updates the FirePro S10000 to 12GB and passive cooling

203061_FirePro_S10000Passive_AngleLet the competition on large memory GPUs begin!

Some algorithms and continuous batch processes will have the joy of the extra memory. For example when inverting a large matrix or doing huge simulations, you need as much memory as possible. or to avoid memory-bank conflicts by duplicating data-objects (possible only when the data is in memory for a longer time to pay for the time it costs to duplicate the data).

Another reason for larger memories is dual precision computations (this one has a total of 1.48 TFLOPS), which doubles memory-requirements. With Accelerators getting better fit for HPC (true support for IEEE-754 double precision storage format, ECC-memory), memory-size becomes one of limits that needs to be solved.

The other choice is swapping on GPUs or to use multi-core CPUs. Swapping is not an option as it nulls all the speed-up. A server with 4 x 16-core CPUs are as expensive as one Accelerator, but use more energy.

AMD seems to have identified this as an important HPC-market therefore just announced the new S10000 with 12GB of memory. To be mailed at AMD-partners in January, and on the market in April. Is AMD finally taking the professional HPC market serious? They now do have the first 12GB GPU-accelerator built for servers.

Old vs New

Still a few question-marks, unfortunately


Functionality FirePro S10000 6GB FirePro S10000 12GB
GPU-Processor count 2 2
Architecture Graphics Core Next Graphics Core Next
Memory per GPU-processor 3 GB GDDR5 ECC 6GB GDDR5 ECC
Memory bandwidth per GPU-processor 240 GB/s per GPU 240 GB/s per GPU
Performance (single precision, per GPU-proc.) 2.95 TFLOPS per GPU 2.95 TFLOPS per GPU
Performance (double precision, per GPU-proc.) 0.74 TFLOPS per GPU 0.74 TFLOPS per GPU
Max power usage for whole dual-GPU card 325 Watt 325 Watt (?)
Greenness for whole dual-GPU card (SP) 20.35 GFLOPS/Watt 18.15 GFLOPS/Watt
Bus Interface PCIe 3.0 x16 PCIe 3.0 x16
Price for whole dual-GPU card $3500 ?
Price per GFLOPS (SP) $0.60 ?
Price per GFLOPS (DP) $2.43 ?
Cooling Active (!) Passive

The biggest differences are the doubling of memory and the passive cooling.

Competitors

Biggest competitor is the Quadro K6000, which I haven’t discussed at all. That card throws out 5.2 TFLOPS using one GPU, being able to access all 12GB of memory via a 384-bit bus at 288 GB/s (when all cores are used). It is actively cooled, so it’s not really fit for servers (like the S10000, 6GB version). The S10000 has a higher bandwidth, but cannot access only half the 12GB from one core at full speed. So the K6000 has the advantage here.

Intel is planning to have 12GB and 16GB XeonPhi’s. I’m curious to more benchmarks of the new cards, as the 5110P does not have very good results (benchmark 1, benchmark 2). It compares more to a high-end Xeon CPU than a GPU. I am more enthusiastic about the OpenCL-performance on their CPUs.

What’s next on this path?

A few questions I asked myself and tried to find answers on.

Extendible memory, like we have for CPUs? Probably not, as GDDR5 is not designed to be upgradable.

Unified memory for multi-GPUs? This would solve the disadvantage of multi-die GPU-cards, as 2, 4 or more GPUs could share the same memory. A reason to watch HSA hUMA‘s progress, which now specifies unified memory access between GPU and CPU.

24GB of memory or more? I’ve found below graph to have an idea of the costs of GDDR-memory, so it’s an option. These prices are of course excluding supplementary parts and R&D-costs for getting more memory accessible to the GPU-cores.

GPU-parts pricing table
GPU-parts pricing table – Q3 2011

At least the question we are going to get answered now: is the market which needs this amount of memory large enough and thus worth serving.

Is there more need for wider memory-bus? Remember that GDDR6 is promised for 2014.

What do you think of a 12GB GPU? Do you think this is the path that distinguishes professional GPUs from desktop-GPUs?

Promotion for OpenCL Training (’12 Q4 – ’13 Q2)

So you want your software to be much faster than the competition?

In 4 days your software team learns all techniques to make extremely fast software.

Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

Want more information? Contact us.

About the training

Location and Time

OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

Some discounts are available for developing countries.

Agenda

Day 1: Introduction

Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

  • Introduction to parallel programming and GPU-programming
  • An overview of parallel architectures
  • The OpenCL model: host-programming and kernel-programming
  • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
  • Data-parallel and task-parallel programming
Lab-session will be an image-filter.
Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

Day 2: Tools and advanced subjects

Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

  • Static kernel analysis
  • Profiling
  • Debugging
  • Data handling and preparation
  • Theoretical backgrounds for faster code
  • Cache flow optimisation
Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

Day 3: Optimisation of memory and group-sizes

Learn the concept of “data-transport is expensive, computations are cheap”.
  • Register usage
  • Data-rearrangement
  • Local and private memory
  • Image/texture memory
  • Bank-conflicts
  • Coalescence
  • Prefetching
Lab-session: various small puzzles, which can be solved using the explained techniques.

Day 4: Optimisation of algorithms

Learn techniques to help the compiler make better and faster code.
  • Precision tinkering
  • Vectorisation
  • Manual loop-unrolling
  • Unbranching
Lab-session: like day 3, but now with compute-oriented problems.

Enrolment

When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

Put any remarks you have in the message. If you have any question, mail to trainings@streamhpc.com.

[si-contact-form form=’7′]

Faster Development Cycles for FPGAs

normal-vs-opencl-fpga-flow
The time-difference between the normal and OpenCL flow is large. The final product is as fast and efficient.

VHDL and Verilog are not the right tools when it comes to developing on FPGAs fast.

  • It is time-consuming. If the first cycle takes 3 months, then each subsequent cycle easily takes 2 weeks. Time is money.
  • Porting or upgrading a design from one FPGA device to another is also time-consuming. This makes it essential to choose the final FPGA vendor and family upfront.
  • Dual-platform development on CPU and FPGA needs synchronisation. The code works on either the CPU or the FPGA, which makes the functional tests made for the CPU-version less trustworthy.

Here is where OpenCL comes in.

  • Shorter development cycles. Programming in OpenCL is normally much faster than in VHDL or Verilog. If you are porting C/C++ code onto FPGA the development cycles will be dramatically shorter. Think weeks instead of months – as this news article explains. This means a radically reduced investment as well as providing time for architectural exploration.
  • OpenCL works on both CPUs and FPGAs, so functional tests can be run on either. As a bonus the code can be optimised for GPUs, within a short time-frame.
  • The performance is equal to VHDL and Verilog, unless FPGA-specific optimisations are used, such as vector-widths not equal to a power of two.
  • Vendor Agnostic solution. Porting to other FPGAs takes considerably less time and the compiler solves this problem for you.
  • Both Xilinx and Altera have OpenCL compilersAltera were the first to come out with an OpenCL offering and have a full SDK, which is an add-on to Quartus II. Xilinx also have a stand-alone OpenCL development environment solution called SDAccel.

Support for OpenCL is strong by both Altera and Xilinx

Both vendors suggest OpenCL to overcome existing FPGA design problems. Altera suggest to use OpenCL to speed-up the process for existing developers. So OpenCL is not a third party tool, you need to trust separately.

OpenCL allows a user to abstract away the traditional hardware FPGA development flow for a much faster and higher level software development flow – Altera

Xilinx suggests that OpenCL can enable companies without the needed developer resources to start working with FPGAs.

Teams with limited or no FPGA hardware resources, however, have found the transition to FPGAs challenging due to the RTL (VHDL or Verilog) development expertise needed to take full advantage of these devices. OpenCL eases this programming burden – Xilinx

Why choose StreamHPC?

There are several reasons to choose letting us to do the porting and protoyping of your product.

  • We have the right background, as our team consists of CPU, GPU and FPGA developers. Our code is therefore designed with easy porting in mind.
  • Our costs are lower than having the product done in Verilog/VHDL.
  • We give guarantees and support for our products on all platforms the product is ported on.
  • We can port the final OpenCL code to Verilog/VHDL, keeping the same performance. In case you don’t trust a high-level language, we have you covered.
  • Optionally you can get both the code and a technical report with a detailed explanation of how we did it. So you can learn from this and modify the code yourself.
  • You get free advice on when (and not) to use OpenCL for FPGAs.

There are three ways to get in contact quickly:

Contact - call call: +31 854865760 (European office hours)

Contact - mail e-mail: contact@streamhpc.com

Fill in this form – mention when you want to be called back (possible outside normal office hours):

[contact_form]

Want to read more?

We wrote about OpenCL-on-FPGAs on our blog in the previous years.

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

Academic hackatons for Nvidia GPUs

Are you working with Nvidia GPUs in your research and wish Nvidia would support you as they used to 5 years ago? This is now done with hackatons, where you get one full week of support, to get your GPU-code improved and your CPU-code ported. Still you have to do it yourself, so it’s not comparable to services we provide.

To start, get your team on a decision to do this. It takes preparation and a clear formulation of what your goals are.

When and where?

It’s already April, so some hackatons have already taken place. For 2019, these are left where you can work on any language, from OpenMP to OpenCL and from OpenACC to CUDA. Python + CUDA-libraries is also no problem, as long as the focus is Nvidia.

Continue reading “Academic hackatons for Nvidia GPUs”

Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor. Continue reading “Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks”

Making the release version of prototype code

math-hardware-codingWe transform your prototype into a performant product

Protoypes are very useful to proof the concepts are working and to test out various variations in a short time. Often Matlab code, Python, Excel or Labview are used, or libraries like OpenCV. The main problems with that code are often performance and portability. We solve that. Our software development projects are exactly very like the software projects you are used to; the main difference is that the code is much faster than if somebody else does it.

An overview of our advantages:

  • As we only have senior programmers, the code quality is higher.
  • We are very much in control and can therefore have more variety:
    • We make our software seamlessly work together with any code that doesn’t need a rewrite.
    • Our code can work on various hardware, like CPU, GPU and FPGA.
    • Our code works on Windows, Linux and OSX.
  • You will get the fastest software on the market. We have often have delivered speed-ups above expectation.

 

OpenCL in the cloud – API beta launching in a month

No_coulds_atmWe’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

Strengthen our team as a remote worker (freelancer)

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!

NVIDIA beta-support for OpenCL 2.0 works on Linux too

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM (optional feature) is not.
Device side enqueuePartially. Surprisingly, it
works better than
on Windows
Almost OpenCL programs with device side queue we have tested work.

Some advanced examples with multi-level device side kernel enqueuing
and/or CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP fail. When using device
side queue, it's only possible to use 1D nd-range with uniform work
groups (or without specifying local size). 2D and 3D nd-ranges
don't work.
Work-group functionsYes
PipesNoPipe functions are defined in libOpenCL.so in 378.13 drivers,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNo
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNo

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

SUN jumping on the OpenCL-train?

Edit (27 May 2010): until now Oracle/SUN has not shown anything that would validate this rumour, and the job-posting is not there anymore. Follow us on Twitter to be the first to know if Oracle/SUN will have better support for GPGPU for Solaris and/or Java and/or its hardware.

Job Description: The Power to change your world begins with your work at Sun!
This is a software staff engineering position requiring the ability to design, test, implement and maintain innovative and advanced graphics software. The person in this role is expected to identify areas for improvement and modification of Sun’s platform products and contribute to Sun’s overall product strategy. This person will work closely with others within the team and, as required, across teams to accomplish project objectives. May assume a leadership role in projects, including such activities as leading projects, participator in product planning and technology evaluation and related activities. May use technical leadership and influence to negotiate product design features or applications, both internally, and with open source groups as needed.
Requirements: * Excellent problem solving, critical thinking, and communication
skills
* Excellent knowledge of the C/C++ and Java programming languages
* Thorough working knowledge of 3D graphics, GPU architecture, and
3D APIs, such as OpenGL & Direct3D
* Thorough working knowledge of shader-level languages such as
GLSL, HLSL, and/or Cg
* Experience designing cross-platform, public APIs for developers
(Windows/MacOS/Linux)
* Experience with multi-threaded programming and debugging techniques
* Experience with operating systems level engineering
* Experience with performance profiling, analysis, and optimization
Education and Experience: Univ degree in computer sciene or engineering plus 5 years direct experience

In other words, a specialist in everything graphics-cards and Java, in a completely new area. Since OpenGL is a already known area not needing such a specialist for, there is a very good chance it will target OpenCL. A good choice.

Sun has all reasons to jump the train with Java, since Microsoft is already integrating loads of OpenCL-tools into its Visual Studio product (created by AMD and nVidia). Java has still more than 3 times the market share than C#, but with this late jump the gap will be closer in favour of Microsoft. Remember C# can easily call C-functions (which is the language OpenCL is written in); Java has a far more difficult task when it comes to calling C-functions without hazards, which is sort of implemented here and here. If OpenCL would not be implemented by Java, C, C++ and C# will make a jump a hole in Java’s share.

Besides Java, also Sun’s super-multi-threaded Sparc-servers will be in trouble since the graphic-cards of nVidia and AMD are now serious competitors. There is no official support of Sparc-processors for OpenCL, wile AMD has included X86-support and IBM PowerPC-support (also working on Cell) a few months ago.

Then we have the databases Oracle and MySQL; Oracle depends on Java a lot. While we see experiments speeding up competitor PostgreSQL with GPU-power, Oracle might become the slow turtle in a GPU-ruled database-world. MySQL has the same development-speed and also “bleeding edge” releases, but Oracle might slow down its official support. Expect Microsoft to have SQL-server fully loaded in its next major release.

If Oracle/Sun jumped the train today, expect no OpenCL-products from Oracle/Sun before Q2-2011.

Benchmarks Q1 2011

February Benchmark Month. The idea is that you do at least one of the following benchmarks and put the results on the Khronos Forum. If you encounter any technical problems or you think the benchmark favours a certain brand, discuss it below this post. If I missed a benchmark, please put a comment under this post.

Since OpenCL works on all kinds of hardware, we can find out which is the fastest: Intel, AMD or NVIDIA.I don’t think all benchmarks are fit for IBM’s hardware, but I hope to see results of some IBM Cells too. If all goes well, I’ll show the first results of the fastest cards posted in April . Know that if the numbers are off too much, I might want to see further proof.

Happy benchmarking!

Continue reading “Benchmarks Q1 2011”

Waiting for Mobile OpenCL – Q1 2011

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

  • Adapting code for embedded profile
  • Added macro __EMBEDDED_PROFILE__
  • CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
  • Online compiler is optional
  • No 64-bit integers
  • Reduced requirements for constant buffers, object allocation, constant argument count and local memory
  • Image & floating point support matches OpenGL ES 2.0 texturing
  • The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011”

Is OpenCL coming to Apple iOS?

Answer: No, or not yet. Apple tested Intel and AMD hardware for OSX, and not portable devices. Sorry for the false rumour; I’ll keep you posted.

Update: It seems that OpenCL is on iOS, but only available to system-libraries and not for apps (directly). That explains part of the responsiveness of the system.

At the thirteenth of August 2011 Apple askked the Khronosgroup to test 7 unknown devices if they are conformant with OpenCL 1.1. As Apple uses OpenCL-conformant hardware by AMD, NVidia and Intel in their desktops, the first conclusion is that they have been testing their iOS-devices. A quick look at the list of available iOS devices for iOS 5 capable devices gives the following potential candidates:

  • iPhone 3GS
  • iPhone 4
  • iPhone 5
  • iPad
  • iPad 2
  • iPod Touch 4th generation
  • Apple TV
If OpenCL comes to iOS soon (as it is already tested), iOS 5 would be the moment. iOS 5 processors are all capable of getting speed-up by using OpenCL, so it is no nonsense-feature. This could speed up many features among media-conversion, security-enhancements and data-manipulation of data-streams. Where now the cloud or the desktop has to be used, in the future it can be done on the device.

Continue reading “Is OpenCL coming to Apple iOS?”