Basic concepts: Function Qualifiers

19092053_m
Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification)))  void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.

Continue reading “Basic concepts: Function Qualifiers”

AMD updates the FirePro S10000 to 12GB and passive cooling

203061_FirePro_S10000Passive_AngleLet the competition on large memory GPUs begin!

Some algorithms and continuous batch processes will have the joy of the extra memory. For example when inverting a large matrix or doing huge simulations, you need as much memory as possible. or to avoid memory-bank conflicts by duplicating data-objects (possible only when the data is in memory for a longer time to pay for the time it costs to duplicate the data).

Another reason for larger memories is dual precision computations (this one has a total of 1.48 TFLOPS), which doubles memory-requirements. With Accelerators getting better fit for HPC (true support for IEEE-754 double precision storage format, ECC-memory), memory-size becomes one of limits that needs to be solved.

The other choice is swapping on GPUs or to use multi-core CPUs. Swapping is not an option as it nulls all the speed-up. A server with 4 x 16-core CPUs are as expensive as one Accelerator, but use more energy.

AMD seems to have identified this as an important HPC-market therefore just announced the new S10000 with 12GB of memory. To be mailed at AMD-partners in January, and on the market in April. Is AMD finally taking the professional HPC market serious? They now do have the first 12GB GPU-accelerator built for servers.

Old vs New

Still a few question-marks, unfortunately


Functionality FirePro S10000 6GB FirePro S10000 12GB
GPU-Processor count 2 2
Architecture Graphics Core Next Graphics Core Next
Memory per GPU-processor 3 GB GDDR5 ECC 6GB GDDR5 ECC
Memory bandwidth per GPU-processor 240 GB/s per GPU 240 GB/s per GPU
Performance (single precision, per GPU-proc.) 2.95 TFLOPS per GPU 2.95 TFLOPS per GPU
Performance (double precision, per GPU-proc.) 0.74 TFLOPS per GPU 0.74 TFLOPS per GPU
Max power usage for whole dual-GPU card 325 Watt 325 Watt (?)
Greenness for whole dual-GPU card (SP) 20.35 GFLOPS/Watt 18.15 GFLOPS/Watt
Bus Interface PCIe 3.0 x16 PCIe 3.0 x16
Price for whole dual-GPU card $3500 ?
Price per GFLOPS (SP) $0.60 ?
Price per GFLOPS (DP) $2.43 ?
Cooling Active (!) Passive

The biggest differences are the doubling of memory and the passive cooling.

Competitors

Biggest competitor is the Quadro K6000, which I haven’t discussed at all. That card throws out 5.2 TFLOPS using one GPU, being able to access all 12GB of memory via a 384-bit bus at 288 GB/s (when all cores are used). It is actively cooled, so it’s not really fit for servers (like the S10000, 6GB version). The S10000 has a higher bandwidth, but cannot access only half the 12GB from one core at full speed. So the K6000 has the advantage here.

Intel is planning to have 12GB and 16GB XeonPhi’s. I’m curious to more benchmarks of the new cards, as the 5110P does not have very good results (benchmark 1, benchmark 2). It compares more to a high-end Xeon CPU than a GPU. I am more enthusiastic about the OpenCL-performance on their CPUs.

What’s next on this path?

A few questions I asked myself and tried to find answers on.

Extendible memory, like we have for CPUs? Probably not, as GDDR5 is not designed to be upgradable.

Unified memory for multi-GPUs? This would solve the disadvantage of multi-die GPU-cards, as 2, 4 or more GPUs could share the same memory. A reason to watch HSA hUMA‘s progress, which now specifies unified memory access between GPU and CPU.

24GB of memory or more? I’ve found below graph to have an idea of the costs of GDDR-memory, so it’s an option. These prices are of course excluding supplementary parts and R&D-costs for getting more memory accessible to the GPU-cores.

GPU-parts pricing table
GPU-parts pricing table – Q3 2011

At least the question we are going to get answered now: is the market which needs this amount of memory large enough and thus worth serving.

Is there more need for wider memory-bus? Remember that GDDR6 is promised for 2014.

What do you think of a 12GB GPU? Do you think this is the path that distinguishes professional GPUs from desktop-GPUs?

Question: do we work with CUDA?

Answer: Yes, actually a lot!

The company was built on OpenCL and we are still work with the language a lot – from embedded GPUs and FPGAs to high-end GPUs. Like OpenCL unjustly isn’t associated with clusters full of professional GPUs, we were not associated with CUDA. I can tell many of our customers have found us to build high performance software in CUDA.

Breaking with the past is not easy due to associations that seem to stick. With the name change from StreamComputing to Stream HPC some years ago, we wanted to enforce that break with being “the OpenCL company”. For some time we were much more pragmatic in solving the problems of our customers, which resulted in making software in MPI and CUDA – sometimes an unexpected direction as the customer initially chose OpenCL.

We also started hiring people who only knew CUDA (but expect them to learn OpenCL), as the right algorithm and the right processor is more important. Internships with CUDA, large CUDA-projects, seeking better relations with Nvidia and such – all have been going on for years. And we like it as much as we like OpenCL – both have unique advantages.

So if you have questions about CUDA, don’t be afraid that you hurt us – we’re happy to help you get fast software.

The Art of Benchmarking

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

  • What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
  • How can each of these elements be quantified?
  • How can each of these elements be measured for both own cars and other cars?
  • And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?
Continue reading “The Art of Benchmarking”

Our offices

We’re expanding to more cities, to be closer to talent and our customers. The idea is to have multiple smaller offices instead of a few big ones. The idea for this was a simple set of questions on how work would be in 2030. The lines between offices would be shifting – not all is to be defined by walls. So smaller offices nearby, with the flexibility to temporarily move to another city, would be much more suited for what is expected in 2030.

Each city has one or two senior developer+manager person, who takes lead when the project-complexity demands it.

In HQ the main structure is provided for onboarding, administration, sales and such. All to make sure the different cities only have a few local things to take care off, so the focus can be on building great software and efficiently handling the projects.

EU – NL – Amsterdam

Koningin Wilhelminaplein 1 – 40601, 1062HG, Amsterdam, Netherlands

Amsterdam is the economic center of the Netherlands, a small country with 17 million inhabitants. It’s the home of HPC-companies like Bright Computing and ClusterVision, and has a large IT workforce that also feed the R&D demand of large international companies. As the number of companies settling here is still growing, Amsterdam is even planning to build a complete new city for 40 to 70 thousand people in the harbour area.

There are different sides of the city. When you think of Amsterdam as a tourist, you might think of the Anne Frank House, Gay Parade, Van Gogh Museum, the Red Light District, the canals, windmills and Tulips. If you would consider living here there are about the 180 different nationalities that live in the city, the 22 international schools and two universities, the vibrant night life and the many-villages-make-the-city atmosphere. Locals of all professions are fluent in English and there is a lively expat community.

You don’t need to live in Amsterdam, as there are several cities and villages nearby with all unique identities. As the Dutch infrastructure is of high standard, Amsterdam is easy to reach via train (and car) from several nearby cities and villages. For instance taking the train from Haarlem to the office takes 9 to 13 minutes, Leiden or Utrecht half an hour. Want to live at the sea? Zandvoort to the office is 25 minutes.

Expats (both single and with family) say they found it easy to build up a social life. For Europeans it’s very easy to move to Amsterdam, as there are no real borders in the EU.

EU – HU – Budapest

Radnóti Miklós u. 2, Budapest, 1137, Hungary

Two cities, Buda and Pest with both their own characteristics form the 1,75 million large capital of Hungary and the ninth-largest city in the EU. The country (est. in 895) has almost 10 million inhabitants.

There is more high-tech industry than you might think. Hungary has one of the highest rates of filed patents, the 6th highest ratio of high-tech and medium high-tech output in the total industrial output, the 12th-highest research Foreign Direct Investment inflow, placed 14th in research talent in business enterprise and has the 17th-best overall innovation efficiency ratio in the world.

If you walk in the city, you’ll find no average Hungarian. There is much creativity hidden and there’s a rich beer-culture. There is this unique quiet vibrant atmosphere that makes you immediately feel at home.

EU – ES – Barcelona

Better weather during winter than in Amsterdam and Budapest and a vibrant tech-city. It hosts the famous Barcelona Supercomputing Center, and is strong tech-hub.

Contenders

We’re researching multiple cities for starting a new office. Due to Covid these researches have been delayed a lot.

  • EU – NL – Utrecht
  • EU – NL – Eindhoven
  • EU – PL – Warsaw
  • EU – FR – Paris
  • EU – FR – Grenoble
  • EU – DE – Heidelberg
  • UK – Bristol

If you live in one of these cities and are good with GPUs, do get in contact. We start with these people:

  • An experienced developer who can manage projects
  • Three to four medior/senior developers
  • A temporary “location starter”
  • Optionally a sales-person

SC14 Workshop Proposals due 7 February 2014

Just to let you know that there should be even more OpenCL and related technologies on SC14

sc14emailheader

Are you interested in hosting a workshop at SC14?

Please mark your calendars as SC will be accepting proposals from 1 January – 7 February for independently planned full-, half-, or multi-day workshops.

Workshops complement the overall SC technical program. The goal is to expand the knowledge base of practitioners and researchers in a particular subject area providing a focused, in-depth venue for presentations, discussion and interaction. workshop proposals will be peer-reviewed academically with a focus on submissions that wil inspire deep and interactive dialogue in topics of interest to the HPC community.

For more information, please consult: http://sc14.supercomputing.org/program/workshops

Important Submission Info

Web Submissions Open: 1 January 2014
Submission Deadline: 7 February 2014
Notification of acceptance: 28 March 2014

Submit proposals via: https://submissions.supercomputing.org/
Questions: workshops@info.supercomputing.org

We’re thinking of proposing one too. Want to collaborate? Mail us! Don’t forget, to go to HiPEAC (20 January) and IWOCL (12 May) to meet us!

Theoretical transfer speeds visualised

There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right.  What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

Continue reading “Theoretical transfer speeds visualised”

Double the performance on AMD Catalyst by tweaking subgroup operations

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations”

Faster Development Cycles for FPGAs

normal-vs-opencl-fpga-flow
The time-difference between the normal and OpenCL flow is large. The final product is as fast and efficient.

VHDL and Verilog are not the right tools when it comes to developing on FPGAs fast.

  • It is time-consuming. If the first cycle takes 3 months, then each subsequent cycle easily takes 2 weeks. Time is money.
  • Porting or upgrading a design from one FPGA device to another is also time-consuming. This makes it essential to choose the final FPGA vendor and family upfront.
  • Dual-platform development on CPU and FPGA needs synchronisation. The code works on either the CPU or the FPGA, which makes the functional tests made for the CPU-version less trustworthy.

Here is where OpenCL comes in.

  • Shorter development cycles. Programming in OpenCL is normally much faster than in VHDL or Verilog. If you are porting C/C++ code onto FPGA the development cycles will be dramatically shorter. Think weeks instead of months – as this news article explains. This means a radically reduced investment as well as providing time for architectural exploration.
  • OpenCL works on both CPUs and FPGAs, so functional tests can be run on either. As a bonus the code can be optimised for GPUs, within a short time-frame.
  • The performance is equal to VHDL and Verilog, unless FPGA-specific optimisations are used, such as vector-widths not equal to a power of two.
  • Vendor Agnostic solution. Porting to other FPGAs takes considerably less time and the compiler solves this problem for you.
  • Both Xilinx and Altera have OpenCL compilersAltera were the first to come out with an OpenCL offering and have a full SDK, which is an add-on to Quartus II. Xilinx also have a stand-alone OpenCL development environment solution called SDAccel.

Support for OpenCL is strong by both Altera and Xilinx

Both vendors suggest OpenCL to overcome existing FPGA design problems. Altera suggest to use OpenCL to speed-up the process for existing developers. So OpenCL is not a third party tool, you need to trust separately.

OpenCL allows a user to abstract away the traditional hardware FPGA development flow for a much faster and higher level software development flow – Altera

Xilinx suggests that OpenCL can enable companies without the needed developer resources to start working with FPGAs.

Teams with limited or no FPGA hardware resources, however, have found the transition to FPGAs challenging due to the RTL (VHDL or Verilog) development expertise needed to take full advantage of these devices. OpenCL eases this programming burden – Xilinx

Why choose StreamHPC?

There are several reasons to choose letting us to do the porting and protoyping of your product.

  • We have the right background, as our team consists of CPU, GPU and FPGA developers. Our code is therefore designed with easy porting in mind.
  • Our costs are lower than having the product done in Verilog/VHDL.
  • We give guarantees and support for our products on all platforms the product is ported on.
  • We can port the final OpenCL code to Verilog/VHDL, keeping the same performance. In case you don’t trust a high-level language, we have you covered.
  • Optionally you can get both the code and a technical report with a detailed explanation of how we did it. So you can learn from this and modify the code yourself.
  • You get free advice on when (and not) to use OpenCL for FPGAs.

There are three ways to get in contact quickly:

Contact - call call: +31 854865760 (European office hours)

Contact - mail e-mail: contact@streamhpc.com

Fill in this form – mention when you want to be called back (possible outside normal office hours):

[contact_form]

Want to read more?

We wrote about OpenCL-on-FPGAs on our blog in the previous years.

Promotion for OpenCL Training (’12 Q4 – ’13 Q2)

So you want your software to be much faster than the competition?

In 4 days your software team learns all techniques to make extremely fast software.

Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

Want more information? Contact us.

About the training

Location and Time

OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

Some discounts are available for developing countries.

Agenda

Day 1: Introduction

Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

  • Introduction to parallel programming and GPU-programming
  • An overview of parallel architectures
  • The OpenCL model: host-programming and kernel-programming
  • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
  • Data-parallel and task-parallel programming
Lab-session will be an image-filter.
Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

Day 2: Tools and advanced subjects

Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

  • Static kernel analysis
  • Profiling
  • Debugging
  • Data handling and preparation
  • Theoretical backgrounds for faster code
  • Cache flow optimisation
Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

Day 3: Optimisation of memory and group-sizes

Learn the concept of “data-transport is expensive, computations are cheap”.
  • Register usage
  • Data-rearrangement
  • Local and private memory
  • Image/texture memory
  • Bank-conflicts
  • Coalescence
  • Prefetching
Lab-session: various small puzzles, which can be solved using the explained techniques.

Day 4: Optimisation of algorithms

Learn techniques to help the compiler make better and faster code.
  • Precision tinkering
  • Vectorisation
  • Manual loop-unrolling
  • Unbranching
Lab-session: like day 3, but now with compute-oriented problems.

Enrolment

When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

Put any remarks you have in the message. If you have any question, mail to trainings@streamhpc.com.

[si-contact-form form=’7′]

Academic hackatons for Nvidia GPUs

Are you working with Nvidia GPUs in your research and wish Nvidia would support you as they used to 5 years ago? This is now done with hackatons, where you get one full week of support, to get your GPU-code improved and your CPU-code ported. Still you have to do it yourself, so it’s not comparable to services we provide.

To start, get your team on a decision to do this. It takes preparation and a clear formulation of what your goals are.

When and where?

It’s already April, so some hackatons have already taken place. For 2019, these are left where you can work on any language, from OpenMP to OpenCL and from OpenACC to CUDA. Python + CUDA-libraries is also no problem, as long as the focus is Nvidia.

Continue reading “Academic hackatons for Nvidia GPUs”

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor. Continue reading “Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks”

Is OpenCL coming to Apple iOS?

Answer: No, or not yet. Apple tested Intel and AMD hardware for OSX, and not portable devices. Sorry for the false rumour; I’ll keep you posted.

Update: It seems that OpenCL is on iOS, but only available to system-libraries and not for apps (directly). That explains part of the responsiveness of the system.

At the thirteenth of August 2011 Apple askked the Khronosgroup to test 7 unknown devices if they are conformant with OpenCL 1.1. As Apple uses OpenCL-conformant hardware by AMD, NVidia and Intel in their desktops, the first conclusion is that they have been testing their iOS-devices. A quick look at the list of available iOS devices for iOS 5 capable devices gives the following potential candidates:

  • iPhone 3GS
  • iPhone 4
  • iPhone 5
  • iPad
  • iPad 2
  • iPod Touch 4th generation
  • Apple TV
If OpenCL comes to iOS soon (as it is already tested), iOS 5 would be the moment. iOS 5 processors are all capable of getting speed-up by using OpenCL, so it is no nonsense-feature. This could speed up many features among media-conversion, security-enhancements and data-manipulation of data-streams. Where now the cloud or the desktop has to be used, in the future it can be done on the device.

Continue reading “Is OpenCL coming to Apple iOS?”

OpenCL in the cloud – API beta launching in a month

No_coulds_atmWe’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

Internships: the self-driving vehicle – updated

UPDATE: We now only offer thesis support (“externs”), for students who want to use OpenCL in their research, but don’t have such support at their university. for the rest the below applies.

From July there are several internships available here at StreamHPC, all around self-driving vehicles (or even self-flying drones). 800px-Toy_car_1This means that with an interest in AI, embedded programming and sensors, you’re all set.

You can work as an intern for a period from 1 to 6 months, and combine it with your thesis. We will assist you with planning, thesis correction and technical support (especially OpenCL). There are also a few other startups in the building, who you’d like to talk with.

Your time will exist of literature studies, programming, testing, OpenCL-optimisations and playing. We’ll work with bikes and toy-cars, so no big cars that are expensive to crash. Study fields are road-location, obstacles, driving-style detection, etc.

If you want to do an internship purely to gain experience, we can offer you a combination of research and working for real customers.

Some targets:

  • Create a small test-car full with sensors:
    • radar for distance
    • multi cameras
    • laser
    • other sensors, like touch
  • Programming an embedded board with OpenCL-capability.
  • Programming pointcloud algorithms in OpenCL.
  • Defining the location on the road, also in OpenCL. (taken)
  • Detecting pedestrians, signs.
  • Have fun creating this.

Please contact us and tell your ideas and plan.

Benchmarks Q1 2011

February Benchmark Month. The idea is that you do at least one of the following benchmarks and put the results on the Khronos Forum. If you encounter any technical problems or you think the benchmark favours a certain brand, discuss it below this post. If I missed a benchmark, please put a comment under this post.

Since OpenCL works on all kinds of hardware, we can find out which is the fastest: Intel, AMD or NVIDIA.I don’t think all benchmarks are fit for IBM’s hardware, but I hope to see results of some IBM Cells too. If all goes well, I’ll show the first results of the fastest cards posted in April . Know that if the numbers are off too much, I might want to see further proof.

Happy benchmarking!

Continue reading “Benchmarks Q1 2011”

What is Performance Engineering?

Software Performance Engineering is increasing the throughput and speed of software by making better use of the hardware possibilities. It uses faster algorithms and apply less data-intensive programming-concepts.

With new software, performance-requirements can be specified beforehand. This can be supported by specifying benchmarks in the test-cases.

Post-production happens more often, as requirements outgrow the defined ones. and it contains the following phases:

[list2]

  1. Reverse engineering the code and compare with the original requirement-documents
  2. Measuring the code performance to find bottle-necks.
  3. Redesigning the code such that it supports current requirements.
  4. Implementing optimizations.

[/list2]

Most times if performance engineering is needed for a software less than 3 years old, or when the code was not designed or managed well. Reverse-engineering the code could reduce time when rebuilding the software.

StreamHPC is famous for porting software to accelerators like the GPUs. But performance engineering is much more than that, as accelerators can only be used when the original code meets minimum quality standards to be able to increase performance.

Black-Scholes mixing on SandyBridge, Radeon and Geforce

Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel’s, while AMD chose to write down the algorithm quite different.
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?

  • Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz
  • GeForce GTX 560 @810MHz, Mem @1000MHz
  • Radeon HD 6870 @930MHz, Mem @1030MHz

Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel’s and AMD’s driver for CPUs. Continue reading “Black-Scholes mixing on SandyBridge, Radeon and Geforce”

Reducing downtime with OpenCL… Ever thought of that?

downtimeSomething that creates extra value for Open CL is the flexibility with which it runs on an important variety of hardware. A famous strategy is running the code on CPUs to find data-races and debug the code more easily. Another is to develop on GPUs and port to FPGAs to reduce the development-cycles.

But there’s one, quite important, often forgotten: replacement of faulty hardware. You can blame the supplier, or even Murphy if you want, but what is almost certain is that there’s a high chance of facing downtime precisely when the hardware cannot be replaced right-away.

Fail to plan is planning to fail

To limit downtime, there are a few options:

  • Have a good SLA in place for 24/7 hardware-replacement.
  • Have spare-hardware in stock.
  • Have over-capacity on your compute-servers.

But the problem is that all three are expensive in some form if you’re not flexible enough. If you use professional accelerators like Intel XeonPhi, NVidia Tesla or AMD FirePro, you risk having unexpected stock shortage at your supplier.

With OpenCL the hardware can be replaced by any accelerator, whereas with vendor-specific solutions this is not possible.

Flexibility by OpenCL

I’d like to share with you one example how to introduce flexibility in your hardware-management, but there are various others which are more tailored to your requirements.

To detect faulty hardware, you can think of a server with three GPUs and let selected jobs be run by all three – any hardware-problem will be detected and pin-pointed. Administrating which hardware has done which job completes the mechanism. Exactly this can be used to replace faulty hardware with any accelerator: let the replacement-accelerator run the same jobs as the other two as an acceptance-test.

If you need your software to be optimised for several accelerators, you’re in the right place. We can help you with both machine and hand optimizations. That’s a plan that cannot fail!

Waiting for Mobile OpenCL – Q1 2011

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

  • Adapting code for embedded profile
  • Added macro __EMBEDDED_PROFILE__
  • CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
  • Online compiler is optional
  • No 64-bit integers
  • Reduced requirements for constant buffers, object allocation, constant argument count and local memory
  • Image & floating point support matches OpenGL ES 2.0 texturing
  • The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011”