End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.
After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.
We can download the SDK and run example programs, but the big piece of software that needs to run faster is not helped with that. Even if it exactly needed code provided for free in the SDK, we still have a base problem. Since I thought banks and other financial institutions will not always have the convenience of starting from scratch, I started my talk with software engineering to add OpenCL to existing projects. Or: when is software ready to have OpenCL added?
A problem in IT is that the world does not adapt to fixed targets but changes continuously. Since all folks at “the business” understand Excel, there can be some funny and less funny situations where the software evolves into a big chunk of functionality. I called it a platform, since an often asked question is “can my algorithm run on it?”. I stated that software that has been labelled as platform cannot have OpenCL-algorithms integrated easily. Now you might think this platform-transformation does not apply to your company’s software, but what will you tell your intern client who wants to “test” a variation on the OpenCL-algorithm you just introduced. Change in IT is like a frog in the boiling water.
After explaining the reasons why my main job before StreamHPC was fixing slowly changed architectures, I talked about the needs, problems and solutions (page 14-16). The needs start with “we need speed”, then come the communication-problems (in Excel it works differently than on OpenCL), then the change-management. The can easily be expressed in problem-descriptions, but the solutions need to be reversed: first good change-management, then professionalising communication between groups and then (at last) add the alien code. More about this is in my previous article.
How OpenCL works for you
If you want to learn OpenCL, please check our tutorial-links. For now I focus on different parts tan the basics.
I started with two famous quotes in the massively parallel processing world. The free lunch refers to the increase of clock-speed, execution-optimisation and cache, and you can read the whole article here. Cray’s quote explains the technique perfectly: the answer on his question is “it depends”. Sometimes you want the oxes and sometimes the chickens, so you do want hybrid processors for sure.
After some basics I list the serious alternatives to OpenCL (pages 26-28). They all have one thing in common: they are limited in some way. Having said that, I took the chance to show how well OpenCL is backed. Then there is my first reference to OpenCL’s maturity, which shows it will actually take another year. But the need is so high, that it grows faster.
Sinus-calculations are done by approximating the function (Taylor-series). You need to know that, because it has consequences. Also GPUs have faster alternatives built in, which can be less precise. Actually a lot of hardware is different in architecture even when they do the same. As a programmer you have to know and understand these differences to get better results. As an example I have FFT, which NVIDIA’s hardware can do much faster. But if you look around the blogosphere and in discussions, you see that AMD has much faster hardware for a many other algorithms. When the hybrid processors arrive, the claims of who is fastest will change again. If you check the measurement-differences (page 39-40, taken from Bealto), please look closely on the change of the y-scale.
Who is familiar in the financial world, knows that there are implementations of Monte Carlo, Mersenne Twister and Black Scholes in the SDKs. I wanted to focus on how to recognise formulas which are fit for implementing in OpenCL (page 46). Actually a lot of formulas are like this, so I hoped to have made many people happy. The example is from the NVIDIA SDK (NVIDIA, thank you!). The result from Fixstar’s OpenCL-book. The results were bad, but if you look at page 54, you see there is much faster hardware available now. The difference in performance between a CPU and OpenCL-hardware increases every year.
What’s out there
Where CUDA is more evolved and only has a focus on one type of hardware, it is easier to use. OpenCL-specifications focuses purely on the main part and leaves the rest to others. So there came libraries to aid the programmer. I picked a few which could be of interest for my public.
Not everybody writes code in C or C++, so OpenCL has been wrapped in to make it usable in i.e. Java, Python and Ruby. The first versions where just basic translations, but they all evolved to convenience-classes. Also C-wrappers are available which do stuff like initialisation for you. They only have advantages, as you might have know from Java, these little programs lead the way to what becomes the next version of the language.
Aparapi is a Java-bytecode-to-OpenCL-translator. It is still in beta, but AMD is working hard to make it available soon. They have also special interest in the financial sector, where Java is used much. I explicitly say that, because the quant-programmers can expect more help than when Aparapi would not take the financial world into account. The example makes clear it is pure Java and looks a bit like OpenCL. The disadvantages are to be solved in the future for sure, except that it will be AMD-only.
This is a C/Fortran-to-OpenCL translator. So in other words, existing code can be translated. This reduces development-time enormously. I will write about this neat program in one of my next articles.
Auto-tuning, like GATLAS
Combining HMPP with auto-tuning-software reduces development-time even more. It finds the right configuration for the initialisation and the kernel, so it runs faster on the current hardware. This is important matter, so many people write scientific articles about it.
Jacket is the best plugin for Matlab you need, when you want speed. If there would be one reason to switch to CUDA, then it would be this program. Also Matlab self is experimenting with GPGPU, but came up with only basic features. I know that the people from Jacket are experimenting with OpenCL too, but they were not happy with the results yet. For developers of (financial) algorithms, this support in mathematical software is very important.
I spoke about drivers, wrappers & libraries maturing, not really about integration in programming languages (LINQ in .NET 4). But a lot of companies want to make the promise “less OpenCL-development time” come true, so it is not only the open source projects of hobbyists that are available.
Hybrid Processors (Intel Sandy Bridge and AMD Fusion) will make a big difference beginning next year, while there is big growth in the market of low power consumption (ARM). The graphs show that in just a few months time the NVIDIA-drivers matured a lot.
I concluded that OpenCL will be the major technique to be used in HPC (yes, I dared to state that), and it is only a matter of time. It is the choice for you to catch the train now to have the advantage of having more time to learn, or to catch the full train later. I’m always happy with discussions, so let me hear in the comments what you think.