An example of real-world, end-user OpenCL usage

We ask all our customers if we could use their story on our webpage. For competition reasons, this is often not possible. The people of CEWE Stiftung & Co. KGaA were so kind to share his experience since he did a OpenCL training with us and we reviewed his code.

Enjoy his story on his experience from the training till now!

This year, the CEWE is planning to implement some program code of the CEWE Photoworld in OpenCL. This software is used for the creation and purchase of photo products such as the CEWE Photobook, CEWE Calendars, greeting cards and other products with an installation base of about 10 million throughout Europe. It is written in Qt and works on Windows, Mac and Linux.

In the next version, CEWE plans to improve the speed of image effects such as the oil painting filter, to become more useful in the world of photo manipulation. Customers like to some imaging effects to improve photo products, to get even more individual results, fix accidentally badly focused photos and so on.

CEWE_Capture

One interesting effect is the so called oil painting filter, which can be used to recover blurry pictures in a slightly artificial form. The filter has a parameter called the brush width. The software provides a slider to manipulate the brush width. If the effect is slow, the slider is jolting, and you see an hourglass instead of a result. You may imagine that trying different brush width is no fun if it takes more than one second to calculate, because you do have to wait a long time before receiving any result. So this filter is a natural candidate to do its number crunching stuff on the GPU.

The current implementation of the oil painting function in the CEWE Photoworld uses the well known ImageMagick library. It takes approx. 40 seconds to calculate on a 20MPix image with a brush width of 10 pixels on an i7 CPU. To be fair I have to admit, that the OpenMP option was switched off because it crashed on some systems. In the consumer market, stability is preferable to speed. Also, speed decreases quadratically with brush size and linearly with the increment of pixels. Also please take into account that consumer computers tend to have slower CPUs than the i7 as well. To tell a long story short: it is no fun.

But there are solutions to the problem. We started by googling how oil painting works, finding a good tutorial at http://supercomputingblog.com/graphics/oil-painting-algorithm/. After some optimization, this program took only 10 seconds: 4 times faster. It also grew only linearly with the brush width. This implementations uses only one CPU core, so with some tricks we might even get a bit faster, too. But of course, project time is always pressing, and there are other things to do. The effect feels better to the user, but still no fun if you have to wait. Well, maybe you are more patient than I am, or your images are smaller. Then it might be okay. This implementation will be shipped in the upcoming winter version of the software. At this stage we have an optimized version of the algorithm on the CPU using C++ with Qt.

But there is space for improvement: two years ago, two developers took an OpenCL training at StreamHPC and they wanted to generate some profit of it. At the „shipping days“, two days in January were developers can choose their favorite project, the time has come to port the algorithm to the GPU. Just porting the code to OpenCL brought execution time to 1.5 seconds on an AMD Radeon HD 6750M (built in on MacBook Pro of 2011).

Some hints from StreamHPC sped up the program to 1.1 seconds on the same machine, or 0.1 sec on a NVIDIA GTX 960 (a current consumer graphics card). At this point, further optimization makes less sense for CEWE, because other effects like cross fading from one image to the next are programmed to 400ms and will obscure other performance issues. You get a smooth transition from one image to the next and Oil painting is fun now. CEWE is planning to roll out the feature in the new version of the CEWE Photoworld, scheduled for the Photokina fair 2016 this September.

Of course, there are some open issues to deal with:

Testing on different machines. One kernel crashed on the NVIDIA GPU first, and the program gave bad feedback about that. The Program runs on a bunch of different machines, and stability is first priority. There are a lot of strangely configured computers in the world.

Work Group Size is firmly set to 64. This could be adapted better to the current machine. We plan to do this conservatively. Again: in the consumer market we have to be cautious.

Optimization of the use of the PCI-Bus. Copying a big picture is costly and should be avoided as much as possible. Right now the same picture is copied forth and back for every move of the slider (costing about 100ms), but because normally there are many calculations on the same image, copying it to the GDDR should be avoided whenever possible.

Porting other imaging effects to the GPU.

As you see in this example, a bunch of rookies can accomplish at least something with one week of training, within two days, and some more with the advice from Mr. Hindriksen and his crew.

Thank you, StreamHPC!

StreamHPC communications