How to introduce HPC in your enterprise

Spare time in IT – ©

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

  • Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
  • NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
  • AMD: For the S9150 get a W9100.
  • Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the  results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

P_setting_fff_1_90_end_500.pngIf your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!