AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000”

AMD positions FirePro S10000 against both TESLA K10 (and K20)

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)”

Khronos Invites Press & Game Developers to Sessions @ GDC San Francisco

Khronos-meetup-march-SanFrancisco

Khronos just sent out the below message to Press and Game Developers. To my understanding, there are many game devs under the readers of this blog, so I’d like you to share the message with you.

JOIN KHRONOS GROUP AT GDC 2014 SAN FRANCISCO
Press Conference, Technology Sessions and Refreshment OasisWe invite you to attend one or more of the Khronos sessions taking place in the Khronos meeting room just off the Moscone show floor. For detailed information on each session, and to register please visit: https://www.khronos.org/news/events/march-meetup-2014.
PRESS CONFERENCE

  • WHEN: Wednesday March 19 at 10:00 AM (Reception 9:30 AM)
  • WHERE: Room 262, West Mezzanine Level, (behind Official Press Room)
  • GUESTS: Members of the Press and Industry by Invitation*
  • RSVP: Jon Hirshon, Horizon PR jh@horizonpr.com

Members of the press are invited to attend the Khronos Press Conference, held jointly again this year with consortium PCGA (PC Gaming Alliance). Khronos will issue significant news on OpenGL ES, WebCL, OpenCL, and several more Khronos technologies, and PCGA will issue news about 2013 Gaming Market numbers. Updates will be delivered by Khronos and PCGA Executives, with insights made by David Cole of DFC and Jon Peddie of Jon Peddie Research.

DEVELOPER SESSIONS

All GDC attendees** are invited to the Khronos Developer Sessions where experts from the Khronos Working Groups will deliver in-depth updates on the latest developments in graphics and media processing. These sessions are packed with information and provide a great opportunity to:

  • Hear about the latest updates from the gurus that invented these technologies
  • See leading-edge demos & applications
  • Put your questions to members of the Khronos working groups
  • Meet with other community members

SESSION SCHEDULE

Wednesday March 19

  • 3:00 – 4:00 : OpenCL & SPIR
  • 4:00 – 5:00 : OpenVX, Camera and StreamInput
  • 5:00 – 6:00 : OpenGL ES
  • 6:00 – 7:00 : OpenGL

Thursday March 20

  • 3:00 – 3:50 : WebCL
  • 4:00 – 4:50 : Collada and glTF
  • 5:00 – 7:00 : WebGL

SESSION REGISTRATION
For information and to register, visit:https://www.khronos.org/news/events/march-meetup-2014

REFRESHMENT OASIS

We thought “Refreshment Oasis” sounded like a nice way to say “sit down and have a cup of coffee while we keep working!”  Khronos is happy to offer a hospitality suite conveniently located next to our primary meeting room (and the official GDC Press room) to showcase Khronos Member technology demos and offer a place for GDC guests, Khronos Members and Marketing staff to meet.  You are welcome to just drop by for a chat, or please email Michelle@GoldStandardGroup.org to arrange a meeting with any Work Group Chairs, Khronos Execs or Marketing Team.

We look forward to seeing you at the show!

*Admittance to the Press Conference is open to all GDC registered Press, and to members of industry on a “Seating Available” basis.  Space is limited so reserve your seat today.

** Admittance to the KHRONOS sessions is FREE but: (1) all attendees must have a GDC Exhibitor or Conference Pass to gain entry to the Khronos meeting room area (GDC tickets details http://www.gdconf.com) and (2) all attendees MUST REGISTER for the individual Khronos API sessions. We expect demand to be high and space is limited.

With open standards becoming more important in the very diverse computer-game industry, Khronos is also growing. If you are in this industry  and want to know (or influence) the landscape for the coming years, you should attend.

AMD OpenCL coding competition

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition”

Overview of OpenCL 2.0 hardware support, samples, blogs and drivers

opencl20We were too busy lately to tell you about it: OpenCL 2.0 is getting ready for prime time! As it makes use of the more recent hardware features, it’s therefore more powerful than OpenCL 1.x could ever be.

To get you up to speed, see this list of new OpenCL 2.0 features:

  • Shared Virtual Memory: host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.
  • Dynamic Parallelism: device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.
  • Generic Address Space: functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.
  • Improved image support:  including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.
  • C11 Atomics: a subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.
  • Pipes: memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.
  • Android Installable Client Driver Extension: Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

I could write many articles about the above subjects, but leave that for later. This article won’t get into these technical details, but more into what’s available from the vendors. So let’s see what toys we were given!

A note: don’t start with OpenCL 2.0 directly, if you don’t know the basic concepts of OpenCL. Continue reading “Overview of OpenCL 2.0 hardware support, samples, blogs and drivers”

OpenCL tutorial videos from Mac Research

macresearchA while ago macresearch.com stopped from existing, as David Gohara pulled the plug. Luckily the sources of a very nice tutorial were not lost, and David gave us permission to share his material.

Even if you don’t have a MAC, then these almost 5 year old materials are very helpful to understand the basics (and more) of OpenCL.

We also have the sources (chapter 4, chapter 6) and the collection of corresponding PDFs for you. All material is copyright David Gahora. If you like his style, also check out his podcasts.

Introduction to OpenCL

http://www.youtube.com/watch?v=oc1-y1V1TPQ

OpenCL fundamentals

http://www.youtube.com/watch?v=FrLqSgYyLQI

Building an OpenCL Project

http://www.youtube.com/watch?v=K7QiD74kMvU

Memory layout and Access

http://www.youtube.com/watch?v=oPE3ypaIEv4

Questions and Answers

http://www.youtube.com/watch?v=9rA6DypMsCU

Shared Memory Kernel Optimisation

http://www.youtube.com/watch?v=oFMPWuMso3Y

Did you like it? Do you have improvements on the code? Want us to share more material? Let us know in the comments, or contact us directly.

Want to learn more? Look in our knowledge base, or follow one of our  trainings.

 

SC15 news from Monday

SC15Warning: below is raw material, and needs some editing.

Today there was quite some news around OpenCL, I’m afraid I can’t wait till later to have all news covered. Some news is unexpected, some is great. Let’s start with the great news, as the unexpected news needs some discussion.

Khronos released OpenCL 2.1 final specs

As of today you can download the header files and specs from https://www.khronos.org/opencl/. The biggest changes are:

  • C++ kernels (still separate source files, which is to be tackled by SYCL)
  • Subgroups are now a core functionality. This enables finer grain control of hardware threading.
  • New function clCloneKernel enables copying of kernel objects and state for safe implementation of copy constructors in wrapper classes. Hear all Java and .NET folks cheer?
  • Low-latency device timer queries for alignment of profiling data between device and host code.

OpenCL 2.1 will be supported by AMD. Intel was very loud with support when the provisional specs got released, but gave no comments today. Other vendors did not release an official statement.

Khronos released SPIR-V 1.0 final specs

SPIR-V 1.0 can represent the full capabilities of OpenCL 2.1 kernels.

This is very important! OpenCL is not the only language anymore that is seen as input for GPU-compilers. Neither is OpenCL hostcode the only API that can handle the compute shaders, as also Vulkan can do this. Lots of details still have to be seen, as not all SPIRV compilers will have full support for all OpenCL-related commands.

With the specs the following tools have been released:

SPIRV will make many frontends possible, giving co-processor powers to every programming language that exists. I will blog more about SPIRV possibilities the coming year.

Intel claims OpenMP is up to 10x faster than OpenCL

The below image appeared on Twitter, claiming that OpenMP was much faster than OpenCL. Some discussion later, we could conclude they compared apples and oranges. We’re happy to peer-review the results, putting the claims in a full perspective where MKL and operation mode is mentioned. Unfortunately they did not react, as <sarcasm>we will be very happy to admit that for the first time in history a directive language is faster than an explicit language – finally we have magic!</sarcasm>

https://twitter.com/StreamHPC/status/666178711549583364

CT3iJBlVEAA2WCY
Left half is FFT and GEMM based, probably using Intel’s KML. All algorithms seems to be run in a different mode (native mode) when using OpenMP, for which intel did not provide OpenCL driver support for.

We get back later this week on Intel and their upcoming Xeon+FPGA chip, if OpenCL is the best language for that job. It ofcourse is possible that they try to run OpenMP on the FPGA, but then this would be big surprise. Truth is that Intel doesn’t like this powerful open standard intruding the HPC market, where they have a monopoly.

AMD claims OpenCL is hardly used in HPC

Well, this is one of those claims that they did not really think through. OpenCL is used in HPC quite a lot, but mostly on NVidia hardware. Why not just CUDA there? Well, there is demand for OpenCL for several reasons:

  • Avoid vendor lock-in.
  • Making code work on more hardware.
  • General interest in co-processors, not specific one brand.
  • Initial code is being developed on different hardware.

Thing is that NVidia did a superb job in getting their processors in supercomputers and clouds. So OpenCL is mostly run on NVidia hardware and a therefore the biggest reason why that company is so successful in slowing the advancement of the standard by rolling out upgrades 4 years later. Even though I tried to get the story out, NVidia is not eager to tell the beautiful love story between OpenCL and the NVidia co-processor, as the latter has CUDA as its wife.

Also at HPC sites with Intel XeonPhi gets OpenCL love. Same here: Intel prefers to tell about their OpenMP instead of OpenCL.

AMD has few HPC sites and indeed there is where OpenCL is used.

No, we’re not happy that AMD tells such things, only to promote its own new languages.

CUDA goes AMD and open source

AMD now supports CUDA! The details: they have made a tool that can compile CUDA to “HiP” – HiP is a new language without much details at the moment. Yes, I have the same questions as you are asking now.

AMD @ SC15-page-007

Also Google joined in and showed progress on their open source implementation of CUDA. Phoronix is currently the best source for this initative and today they shared a story with a link to slides from Google on the project. the results are great up: “it is to 51% faster on internal end-to-end benchmarks, on par with open-source benchmarks, compile time is 8% faster on average and 2.4x faster for pathological compilations compared to NVIDIA’s official CUDA compiler (NVCC)”.

For compiling CUDA in LLVM you need three parts:

  • a pre-processor that works around the non-standard <<<…>>> notation and splits off the kernels.
  • a source-to-source compiler for the kernels.
  • an bridge between the CUDA API and another API, like OpenCL.

Google has done most of this and now focuses mostly on performance. The OpenCL community can use this to use this project to make a complete CUDA-to-SPIRV compiler and use the rest to improve POCL.

Khronos gets a more open homepage

Starting today you can help keeping the Khronos webpage more up-to-date. Just put a pull request at https://github.com/KhronosGroup/Khronosdotorg and wait until it gets accepted. This should help the pages be more up-to-date, as you can now improve the webpages in more ways.

More news?

AMD released HCC, a C++ language with OpenMP built-in that doesn’t compile to SPIRV.

There have been tutorials and talks on OpenCL, which I should have shared with you earlier.

Tomorrow another post with more news. If I forgot something on Sunday or Monday, I’ll add it here.

OpenCL mini buying guide for X86

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86”

USB-stick sized ARM-computers

Now that smartphones get more powerful and internet makes it possible to have all functionality and documents with you anywhere, the computer needs to be reinvented. You see all big IT-companies searching for how that can be, from Windows Metro to complete docking stations to replace the desktop by your phone. A turbulent market.

One of the new products are USB-stick sized computers. Stick them into a TV or monitor, zap in your code and you have your personal working environment. You never need to carry laptops to your hotel-room or conference, as long as a screen is available – any screen.

There are several USB-computers entering the market, but I wanted to introduce you to two. Both of these see a future in a strong processor in a portable device, and both do not have a real product with these strong processors. But you can expect that in 2013 you can have a device that can do very fast parallel processing to have a smooth Photoshop experience… at your key-ring.

Continue reading “USB-stick sized ARM-computers”

Our Job-Application Process – with tips and tricks

Job-applications can be stressful. You need to invest time in a company that you don’t really know and might reject you, or the job turns out not to be as you envisioned it. Also, application-processes are different per company, depending on values and experience – it can be a maze and difficult to understand what is expected.

We’re a relatively small company, but we put more time into our application process than our peers. The feedback we got from past applicants, we used to improve it bit-by-bit. Understand that we want you to successfully walk through the application, and not drown in the process – so if something is not clear on this page, email us via jobs@streamhpc.com. You would be surprised that others ask various questions before starting, and it actually increases the chances of hiring (statistically).

We designed the process such, that the chances you’ll get an offer goes from a few percent in the first step to 90% really quickly. This allows us to spend more time on the people we think make a good chance.

We want you to succeed!

Seriously, the application-process should make sure that people with the right skills get a 100% guarantee to pass.

We wrote this tutorial to help you get through the first rounds successfully. This means that by reading this page, you already increase your chances. As a bonus, various tips&tricks are generally applicable, and you can also use them in other job-offers.

Round 1: CV scanning

How to improve your cv

A CV shows an overview of what you can do with experience as proof. So it does not need to state “he is exceptional in…” but just show what you managed to do. If you did a project with a team, mention your role. If you think an unsuccessful project does not give value, you’re wrong – just clearly state your learning points.

We do a quick scan of your CV and letter, which means that we look for keywords like CUDA, OpenCL, SYCL, GLSL, HLSL, Assembly, etc. Second, we try to assess you on seniority in CPU-programming, GPU-programming and overall project-experience. Example. If you never worked in a team, mostly did C/C++ programming for 15 years and have made your first GPU-software some months ago, we’d assess you as a solo-worker, CPU-senior and GPU-beginner.

Pro-tip: Embrace the idea that companies simply do CV-scanning. But don’t overdo it by summing up every keyword that could ever apply, as you will get questions you cannot answer.

These labels are not good or bad, but just how we think things are. So make sure we can abstract these from your CV and can connect the dots. E.g. if you mention “OpenCL” under skills, but not under any of the experiences, then we might still reject you. It might therefore be good to just mention it under education as “best subjects” instead of skill – just discuss it in your email.

In case you need a sample CV, just use the below one. For the job-descriptions, be specific with what you did. E.g. “Increased x number by y amount”.

Share code

To further support you have experience, recent GPU-code would be very helpful to get through the first filters. Also label it correctly as “university assignment”, “book assignment”, “hobby project”, etc. This helps us to assess your code the right way.

Pro-tip: Clean up your code and add comments. This shows how you would work in a professional environment.

For those who sent GPU-code, we check on coding style, efficiency, applied optimizations, etc. We also check if you used libraries or wrote your own kernels. As the job includes writing those GPU-libraries, we’re not looking for the ones who only use them.

Pro-tip: Split up work on GPU-kernels and usage of libraries. This shows you’re capable of writing GPU-kernels.

Write a motivational email

Last, but not least, always add a motivational letter. Instead of “See my CV attached”, share why you like working with GPUs and HPC. And preferably also what speaks to you about our company. This explains what drives you, and we can quickly find out if we’re a match.

We see templated emails (with sometimes funny mistakes), but it is not really needed to make it that personal. We understand it is time-consuming to do job-applications, so a general text suffices. Think of sentences like:

  • What you seek/need: “Things I value in a job are: ….. I hope I can find them in this job”
  • What you value: “I like working with GPUs, since I did …”
  • What you miss: “I remember a university project ….. I want more of that”

Round 2: short coding test

For those who are left, you do a simple online test in C++ (or C, if you choose that). This test is to get a grasp of your way of working and thinking, and to prepare you for the longer test. We found that puzzle-solvers are good in the work we do. It does help to get some practice in creative coding tasks, if you currently have a boring job.

It takes 10 – 25 minutes, depending on your experience with such tests. We give 30 minutes, which should be enough time for most. If you never did such test, check the tips under round 4A.

Pro-tip: Do the sample test first, if you’re new to such tests. This will allow you to experiment freely.

If you fail the test, you get an email with hints and get the choice to do the long test. This allows those who realized they needed to better prepare, actually do much better in the next round. The hints are written down under round 4A, so you can already prepare.

Reasons people stop here

While we give a choice to continue the process, not everybody takes that opportunity. Here’s some background on their decision.

  1. Finding out it is too difficult. There is now a good understanding and context where the job is actually about. We tried all kinds of ways to message the job is difficult to serve the bored people, but we’ve learned that the texts are understood relatively to somebody’s own experience. We believe in growth (watch “Not there yet” for info on the learning-mindset) and luckily some have people applied again a few years later.
  2. Thinking the job is too difficult, and thus not even doing round 1. We found that many people don’t even apply, because they think they cannot do it, since their friends do rocket-science – that’s a missed chance. Get in contact to discuss your doubts.
  3. Not taking the time-pressure. See what is written under round 4 – we are here to help you pass (if your skills are there)! Get in contact to discuss this, so we can see how we can find solutions.

If you are worried about any of the above, please reach out to us. It is worth doing the short coding test, and we may offer a different alternative to the longer one if you have concerns. We want you to have the best chance possible.

There are more reasons, not mentioned here. We try to get 100% of the people with the right skills through, so feedback is always welcome.

Round 3: video call

First real contact! Here we double-check everything we have assumed, and also answer all your questions. Make sure you have questions prepared. If you don’t have any questions anymore, just take those questions with you and mention that these all have been answered.

Know it’s a fairly relaxed call, where we just want to get to know you. So we don’t want to work through a CV or hear about technical projects at this step. To succeed here, just answer the questions openly and don’t try to give the response you think we want to hear.

Pro-tip: With questions prepared, you signal you’re truly interested in the position.

Here we also discuss your salary expectations. We don’t pay salaries as the financial or energy sector does, and we need to have clarity on this.

Round 4: long coding test

After the call you are invited for a long coding test. We have two variations: the online coding test (4A) and the homework test (4B).

If you’re not that good at coding efficient C++ and doing puzzles, be sure you get more experience in C/C++/GPGPU! It would make sense to wait with your application, or pause your application for a few months, and take it serious. Joining an open source project in C/C++/GPGPU helps a lot with getting through this round, if you seek a method to improve.

Round 4 A: online test

Here you show your skills in C++ and algorithms, not in GPGPU. On average, this takes 2 – 3 hours. There is a warm-up assignment and then 3 bigger assignments. Thus the time per assignment is 45-60 minutes. Understand that we simply test your C++ and puzzle/reading skills, as we need these skills for the projects we run.

If you did the short test really well (80% or higher), chances are increased that you pass this one too. People who did not do well on the first test but studied their mistakes, do also good. In all cases follow the tips under “How to prepare” again, as the pass-rate for seriously prepared people is always higher.

Statistics. Of all the applicants that we send an invite for the test, 67% actually do test and 25% gets score of at least 80. So 37% of the people who start the test, pass with a 80+ score. We of course try to make sure that the numbers get better. Like the 73% who did the test for “nothing” – it actually used to be well over 90%, but we think we can do better. One part is to help applicants better prepare for the test.

How to prepare for the Codility test

The below tips work for any coding-test. If you want a more serious coding-job, you should expect tests and therefore preparations are important. Codility wrote a nice article on how to prepare for the test, including links to sample tests. Start there. Make sure you practice puzzles like you did on university, especially if you are trying to escape a boring job.

The main challenges are working under time-pressure and not being able to get hints, which needs some practice to get used to. Ofcourse we also want to skip these test, and we seriously tried alternatives – we simply found that it is important to know how somebody can solve problems on their own.

General tips during the test:

  1. Read all questions carefully. A typical mistake is not understanding the question – it is therefore much better to spend 10 full minutes on understanding the question (and the provided code) than to start asap.
  2. Plan your time. Estimate how much time each assignment will take you. Focus on the ones you have most confidence in, but use a stop-watch to restrict time to 30-35 minutes. This leaves 10 minutes at the end to double check and test your solution.
  3. Make sure you try each assignment. (100+100+0)/3=66%.
  4. Test. Start by designing a set of tests. With just a few exceptions, all applicants who got high scores, tested their solution in Codility.
  5. Comment your code. Two reasons. One, if your code fails, comments and tests can actually give you the advantage of doubt. Two, explaining the code out loud, supports problem understanding.
  6. Keep your algorithm-books close. We’re not testing your memory, but your skills. It is strictly not allowed to copy code, but it is allowed to double-check an algorithms.

Pro-tip: Those who make time for the test within 2 weeks, have a higher chance to get to the last rounds.

Round 4 B: homework test

If you do this round for the second time or if you are not wiling to be working under time-pressure, we have a homework test. This takes 10 to 20 hours, but can be done in one’s own time. Here you show your skills in C++ algorithms and GPGPU. It shows what somebody’s level really is in the broad sense, which should give you the necessary feedback on progressing on this career-path.

Rounds 5+: The rest

From now on, the chances are 90-95% to get the job! Assuming you did not cheat, but luckily hardly anybody does. The 5-10% is in case of unique situations, we assumed we did not need to test for.

In these rounds we only double-check things, and focus on getting to know you. Notice it takes more than half of the time that needs to be invested. We do:

  • the technical interview on C, C++ and GPGPU (2 hours)
  • the long interview (3 hours)
  • reference-checks

We try to plan all in one week, which makes it intense but fast.

Pro-tip: Read an applicable book before the technical interview. This helps freshen up your theoretical knowledge.

A last remark: the door versus the room

There are two types of people who apply: door-people and room-people. Door-people want to get through the door and then will prove themselves that they are worth it by working really hard. Room-people focus on the room behind that door, and try to find out how compatible we are with each other. Statistically, we almost only hire room-people. This means that if you focus on self-check compatibility for the job, and ask us question about how it is once you are in, chances increase a lot.

If you have questions not discussed here – just email us at jobs@streamhpc.com

We have provided various texts to help you get the information we think is useful:

OpenCL SPIR by example

SPIR2OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

Know that the new SPIR-V is something completely different in implementation, and we are only discussing the old SPIR here.

Contributors for the SPIR specifications are: Intel, AMD, Altera, ARM, Apple, Broadcom, Codeplay, Nvidia, Qualcomm and Xilinx. Boaz Ouriel of Intel is the pen-holder of the specifications and to no surprise Intel has had the first SPIR-compiler. I am happy to see Nvidia is in the committee too, and hope they don’t just take ideas for CUDA from this collaboration but finally join. Broadcom and Xilinx are new, so we can expect stuff from them.

 

For now, just see what SPIR is – as it can help us understand how the compiler work and write better OpenCL code. I used Intel’s offline OpenCL compiler for compiling the below kernel to SPIR can be done on the command line with: ioc64 -cmd=build -input=sum.cl -llvm-spir32=sum.ll (you need an Intel CPU to use the compiler).

[raw]

__kernel void sum(const int size, __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);

  if(ii < size) vec2[ii] += vec1[ii];

}

[/raw]

There are two variations for generating SPIR-code: binary SPIR, LLVM-SPIR (both in 32 and 64 bit versions). As you might understand, the binary form is not really readable, but SPIR described in the LLVM IL language luckily is. Run ioc64 without parameters to see more options (Assembly, pure LLVM, Intermediate Binary).

Scientific Visualisation of Molecules

In many hard sciences focus is on formulas and text, whereas images are mainly graphs or simplified representations of researched matters. Beautiful visualisations are mainly artist’s impressions in popular media targeting hobby-scientists. When Cyrille Favreau made the first good-working version of his real-time GPU-accelerated raytracer, he saw potential in exactly this area: beautiful, realistic visualisations to be used in serious science. This resulted in software called IPV.

He chose to focus on rendering molecules of proteins and this article discusses raytracing in molecular sciences, while highlighting the features of the software.

This project has been discussed on GPU Science, but this article looks at the the software from a slightly different perspective. If you don’t want to know how the software works and what it can do, scroll down for a download-link.

Continue reading “Scientific Visualisation of Molecules”

StreamComputing exists 5 years!

5yearsSCIn January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

  • My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
  • My friends, for understanding I needed to work in the weekends and evenings.
  • My good friend Laura for supporting me during the hard times of 2011 and 2012.
  • My girlfriend Elena for always being there for me.
  • My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
  • My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!”

Differences from OpenCL 1.1 to 1.2

This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

Continue reading “Differences from OpenCL 1.1 to 1.2”

N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

OpenCL.org

opencl-logoLast year we bought OpenCL.org with the purpose to support the OpenCL community and OpenCL-focused companies. In january we launched the first community-project on the website, porting GEGL to OpenCL. See below for more info.

The knowledge section of our homepage will be moved to the OpenCL.org website, but still be maintained by us.

GEGL project

GEGL is a free/libre graph based image processing framework used by GIMP, GNOME Photos, and other free software projects.

In january 2016 we launched an educational initiative that aims to get more developers to study and use OpenCL in their projects. Within this project, up to 20 collaborators will port as many GEGL operations to OpenCL as possible.

The goal of this project is to seek a way for a group to educate themselves in OpenCL, while supporting an open source project. One of the ways is to gamify the porting by benchmarking the kernels and defining winners, and another way is to optimize kernels within StreamHPC to push the limits. Victor Oliveira, who wrote most of the OpenCL code in GEGL, joined the GEGL-OpenCL project to advise.

All work is being done on GitHub. The communication between participants is taking place in a dedicated Slack channel (invite-only).

Want to have a vote on what is the next porting project after GEGL? Vote here.

Caffe and Torch7 ported to AMD GPUs, MXnet WIP

Last week AMD released ports of Caffe, Torch and (work-in-progress) MXnet, so these frameworks now work on AMD GPUs. With the Radeon MI6, MI8 MI25 (25 TFLOPS half precision) to be released soonish, it’s ofcourse simply needed to have software run on these high end GPUs.

The ports have been announced in December. You see the MI25 is about 1.45x faster then the Titan XP. With the release of three frameworks, current GPUs can now be benchmarked and compared.

Especially the expected good performance/price ratio will make this very interesting, especially on large installations. Another slide discussed which frameworks will be ported: Caffe, TensorFlow, Torch7, MxNet, CNTK, Chainer and Theano.

This leaves HIP-ports of TensorFlow, CNTK, Chainer and Theano still be released. Continue reading “Caffe and Torch7 ported to AMD GPUs, MXnet WIP”

Using Qt Creator for OpenCL

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS     += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS     += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

NVIDIA’s answer to FirePro S9000: the TESLA K20

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Continue reading “NVIDIA’s answer to FirePro S9000: the TESLA K20”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”