AMD positions FirePro S10000 against both TESLA K10 (and K20)

Posted by Vincent Hindriksen on 15 November 2012

During the “little” HPC-show, SC12, several vendors have launched some very impressive products. Question is who steals the show from whom? Intel got their Phi-processor finally launched, NVIDIA came with the TESLA K20 plus K20X, and AMD introduced the FirePro S10000.

This card is the fastest card out there with 5.91 TFLOPS of processing power – much faster than the TESLA K20X, which only does 3.95 TFLOPS. But comparing a dual-GPU to a single-GPU card is not always fair. The moment you choose to have more than one GPU (several GPUs in one case or a small cluster), the S10000 can be fully compared to the Tesla K20 and K20X.

The S10000 can be seen as a dual-GPU version of the S90000, but does not fully add up. Most obvious is the big difference in power-usage (325 Watt) and the active cooling. As server-cases are made for 225 Watt cooling-power, this is seen as a potential possible disadvantage. But AMD has clearly looked around – for GPUs not 1U-cases are used, but 3U-servers using the full width to stack several GPUs.

Continue reading “AMD positions FirePro S10000 against both TESLA K10 (and K20)” →

Power to the Vector Processor

Posted by Vincent Hindriksen on 12 August 2011

Reducing energy-consumption is “hot”

After reading this article “Nvidia is losing on the HPC front” by The Inquirer which mixes up the demand for low-power architectures with the other side of the market: the demand for high performance. It made me think that it is not that clear there are two markets using the same technology. Also Nvidia has proven it to be not true, since the super-computer “Nebuale” uses almost half the watts per flop as the #1. How come? I quote The Register from an article of one year old:

>>When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)<<

Nvidia is (and should) be very proud. But actually I’m already looking forward when hybrids get more common. They will really shake up the HPC-market (as The Register agrees) in lowering latency between GPU and CPU and lowering energy-consumption. But where we can find a bigger market is the mobile market.

Continue reading “Power to the Vector Processor” →

AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Posted by Vincent Hindriksen on 13 September 2012 with 8 Comments

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000” →

When Big Data needs OpenCL

Posted by Vincent Hindriksen on 25 August 2012

Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.

A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.

Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.

Continue reading “When Big Data needs OpenCL” →

Will OpenCL work for me?

OpenCL can accelerate your software multiple factors, but… only if the data and the software are fit.

The same applies to CUDA and other GPGPU-methods.

Get to know if you can speed up your software with OpenCL in 4 steps.
[columns]
[one_half title=”1. Lots of repetitions”]
The main focus to find code that can be run in parallel is finding loops that take relatively much time. If an action needs to be done for each part of the data-input, then the code certainly contains a lot of loops. You can go to the next step.

If data goes through the code from A to B in a straight line without many loops, then there is a very low chance that computing-speed is the bottle-neck. A faster network, better caching, faster memory and such should be first looked into.
[/one_half]
[one_half title=”2. No or few dependencies”]
If in loops there are no dependencies on the previous step, then you can go to the next step.

As low interdependencies do not matter for single-core software, this was not an important developer’s focus even five years ago. Know there are many new algorithms now, which decrease loop-internal dependencies. If your software has been optimised for several processors or even a cluster, then the step to OpenCL is much smaller.

For example search-problems can be sped up by dividing the data between many processes. Even though the dependency is high within the thread, the dependency on the other threads is very low.
[/one_half]

[/columns]

[columns]

[one_half title=”3. High predictability to avoid branching”]

Computations need to be as predictable as possible, to get the highest speed-up. That means the code within the loops needs to have no or few branches. That is code without statements like if, while or switch. This is because GPUs work better if the whole processor does the same. So if you now have many threads which all do different things, then a CPU is still the best solution. Like for decreasing dependencies from step two, in many cases redesigning the algorithm can result in performing GPU-code.

[/one_half]

[one_half title=”4. Low Data-transport overhead”]

In step 1 you looked for repeated computations. In this last step we look at the ratio between computations and data-size.

If the computations per data-chunk is high, then using the GPU is a good solution. A simple way to find out if a lot of computations are done is to look at CPU-usage in the system monitor. The reason is that data needs to be transferred to and from the GPU, which takes time even with 3 to 6 GB throughput per second.

When computations per data-chunk is low, doubling of speed is still possible when OpenCL is used on CPUs. See the technical explanation how OpenCL on modern CPUs work and can even outperform a GPU.

[/one_half]
[/columns]

Does it fit?

Found out OpenCL is right for you? Contact us immediately and we can discuss how we can make your software faster. Not sure? Request a code-review or Rapid OpenCL Assessment to quickly find out if it works.

Do you think openCL is not the solution, but still processing data at the limits of your system? Feel free to contact us, as we can give you feedback for free on how to solve your problem with other techniques.

AMD gDEBugger 6.2 for Linux

Posted by Vincent Hindriksen on 19 May 2012 with 1 Comment

The printf-funtion in kernels isn’t the solution to everything, so hence profilers and debuggers specially tailored for GPU-programming. On Windows there is a lot of choice, but mostly only if you have a paid version of Visual Studio. On Linux you have GDB, but that program is not really user-friendly for the GUI-lovers.

For AMD there is now gDEBugger again available for Linux. Again, as version 5.8 by Gremedy worked with Linux, after AMD bought the company it got Windows-only for version 6. A few weeks ago, 10 months after 6.0, Linux-binaries got back with version 6.2. It supports OpenCL 1.2, OpenGL 3.2 and quite some extensions. As only AMD is supported, later more on debugging OpenCL-applications on NVidia and Intel.

Installation is quite straightforward. For creating a menu-item, you’ll find an useful image in /opt/gDEBugger6.2.xxx/tutorial/images/.

Continue reading “AMD gDEBugger 6.2 for Linux” →

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

Posted by Vincent Hindriksen on 12 October 2011 with 6 Comments

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL” →

PDFs of Monday 19 September

Posted by Vincent Hindriksen on 19 September 2011

Already the fourth PDF-Monday. It takes quite some time, so I might keep it to 10 in the future – but till then enjoy! Not sure which to read? Pick the first one (for the rest there is not order).

Edit: and the last one, follow me on twitter to see the PDFs I’m reading. Reason is that hardly anyone clicked on the links to the PDFs.

I would like if you let others know in the comments which PDF you liked a lot.

Adding Physics to Animated Characters with Oriented Particles (Matthias Müller and Nuttapong Chentanez). Discusses how to accelerate movements of pieces of cloth attached to the bodies. Not time to read? There are nice pictures.

John F. Peddy’s analysis on the GPU market.

Hardware/Software Co-Design. Simple Solution to the Matrix Multiplication Problem using CUDA.

CUDA Based Algorithms for Simulating Cardiac Excitation Waves in a Rabbit Ventricle. Bioinformatics.

Real-time implementation of Bayesian models for multimodal perception using CUDA.

GPU performance prediction using parametrized models (Master-thesis by Andreas Resios)

A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations (Alexandre S. Nery, Nadia Nedjah, Felipe M.G. Franca, Lech Jozwiak)

Rapid Geocoding of Satellite SAR Images with Refined RPC Model. An ESA-presentation by Lu Zhang, Timo Balz and Mingsheng Liao.

A Parallel Algorithm for Flight Route Planning with CUDA (Master-thesis by Seçkîn Sanci). About the travelling salesman problem and much more.

Color-based High-Speed Recognition of Prints on Extruded Materials. Product-presentation on how to OCR printed text on cables.

Supplementary File of Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs (Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara). They sped up the BOINC-system (Folding@Home). Bit vague what they want to tell, but maybe you find it interesting.

Parallel Position Weight Matrices Algorithms (Mathieu Giraud, Jean-Stéphane Varré). Bioinformatics, DNA.

GPU-based High Performance Wave Propagation Simulation of Ischemia in Anatomically Detailed Ventricle (Lei Zhang, Changqing Gai, Kuanquan Wang, Weigang Lu, Wangmeng Zuo). Computation in medicine. Ischemia is a restriction in blood supply, generally due to factors in the blood vessels, with resultant damage or dysfunction of tissue

Per-Face Texture Mapping for Realtime Rendering. A Siggraph2011 presentation by Disney and NVidia.

Introduction to Parallel Computing. The CUDA 101 by Victor Eijkhout of University of Texas.

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing. Presentation on what you find out when putting the volt-meter directly on the GPU.

NUDA: Programming Graphics Processors with Extensible Languages. Presentation on NUDA to write less code for GPGPU.

Qt FRAMEWORK: An introduction to a cross platform application and user interface framework. Presentation on the Qt-platform – which has great #OpenCL-support.

Data Assimilation on future computer architectures. The problems projected for 2020.

Current Status of Standards for Augmented Reality (Christine Perey1, Timo Engelke and Carl Reed). not much to do with OpenCL, but tells an interesting purpose for it.

Parallel Computations of Vortex Core Structures in Superconductors (Master-thesis by Niclas E. Wennerdal).

Program the SAME Here and Over There: Data Parallel Programming Models and Intel Many Integrated Core Architecture. Presentation on how to program the Intel MIC.

Large-Scale Chemical Informatics on GPUs (Imran S. Haque, Vijay S. Pande). Book-chapter on the design and optimization of GPU implementations of two popular chemical similarity techniques: Gaussian shape overlay (GSO) and LINGO.

WebGL, WebCL and Beyond! A presentation by Neil Trevett of NVidia/Khronos.

Biomanycores, open-source parallel code for many-core bioinformatics (Mathieu Giraud, Stéphane Janot, Jean-Frédéric Berthelot, Charles Delte, Laetitia Jourdan , Dominique Lavenier , Hélène Touzet, Jean-Stéphane Varré). A short description on the project http://www.biomanycores.org.

Aparapi: OpenCL in Java

Posted by Vincent Hindriksen on 3 August 2011 with 2 Comments

Edit: Aparapi has been open sourced and many issues have already been fixed and improved.

If you have an AMD GPU/APU, you should try Aparapi. This software lets you write OpenCL-code in Java pretty high-level. The idea is that is sort of that it processes the Java intermediate code to search for loops and then create optimised OpenCL-kernels. Just download Aparapi and try the two examples. As the current version is still in alpha, it is not flawless yet. What I think is important when having worked with Aparapi is that you learn how to keep it simple – like you know that you can gain most speed on straight roads and turns slow down.

The Aparapi-team tries to avoid explicit defining of local memory, but it is still possible by using the @Local annotation. Such decisions show the team wants Aparapi to be high-level. It also integrates well with JavaCL and JOCL, so for the kernels you already have created, you can mix. You can also check out a video introducing Aprapi (it is video 15, if #-linking doesn’t work).

Time to create your own project. As not all errors are documented or are solved in the upcoming version, below you will find a list of common errors and how to easily solve them.

Continue reading “Aparapi: OpenCL in Java” →

DirectCompute’s unpopularity

Posted by Vincent Hindriksen on 28 December 2010

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity” →

http://www.flickr.com/photos/imabug/2946930401/

OpenCL Potentials: Medical Imaging

Posted by Vincent Hindriksen on 13 December 2010 with 2 Comments

When you ever saw a CT or MRI scanner, you might have noticed the full-sized computer next to it (especially the older ones). There is quite some processing power needed to keep up with the data-stream coming from the scanner, to process the data to a 3D-image and to visualise the data on a 2D-screen. Luckily we have OpenCL to make it even faster; which doctor doesn’t want real-time high-resolution results and which patient doesn’t want to see the results on Apple iPad or Samsung Galaxy Tab?

Architects, bankers and doctors have one thing in common: they get a better feeling for the current subject if they can play with the data. OpenCL makes it possible to process data much faster and thus let the specialist play with it. The interesting part of IT is that it is in every domain now and therefore a new series: OpenCL-potentials.

Search Results for: {search_term_string}