Khronos Invites Press & Game Developers to Sessions @ GDC San Francisco

Khronos-meetup-march-SanFrancisco

Khronos just sent out the below message to Press and Game Developers. To my understanding, there are many game devs under the readers of this blog, so I’d like you to share the message with you.

JOIN KHRONOS GROUP AT GDC 2014 SAN FRANCISCO
Press Conference, Technology Sessions and Refreshment OasisWe invite you to attend one or more of the Khronos sessions taking place in the Khronos meeting room just off the Moscone show floor. For detailed information on each session, and to register please visit: https://www.khronos.org/news/events/march-meetup-2014.
PRESS CONFERENCE

  • WHEN: Wednesday March 19 at 10:00 AM (Reception 9:30 AM)
  • WHERE: Room 262, West Mezzanine Level, (behind Official Press Room)
  • GUESTS: Members of the Press and Industry by Invitation*
  • RSVP: Jon Hirshon, Horizon PR jh@horizonpr.com

Members of the press are invited to attend the Khronos Press Conference, held jointly again this year with consortium PCGA (PC Gaming Alliance). Khronos will issue significant news on OpenGL ES, WebCL, OpenCL, and several more Khronos technologies, and PCGA will issue news about 2013 Gaming Market numbers. Updates will be delivered by Khronos and PCGA Executives, with insights made by David Cole of DFC and Jon Peddie of Jon Peddie Research.

DEVELOPER SESSIONS

All GDC attendees** are invited to the Khronos Developer Sessions where experts from the Khronos Working Groups will deliver in-depth updates on the latest developments in graphics and media processing. These sessions are packed with information and provide a great opportunity to:

  • Hear about the latest updates from the gurus that invented these technologies
  • See leading-edge demos & applications
  • Put your questions to members of the Khronos working groups
  • Meet with other community members

SESSION SCHEDULE

Wednesday March 19

  • 3:00 – 4:00 : OpenCL & SPIR
  • 4:00 – 5:00 : OpenVX, Camera and StreamInput
  • 5:00 – 6:00 : OpenGL ES
  • 6:00 – 7:00 : OpenGL

Thursday March 20

  • 3:00 – 3:50 : WebCL
  • 4:00 – 4:50 : Collada and glTF
  • 5:00 – 7:00 : WebGL

SESSION REGISTRATION
For information and to register, visit:https://www.khronos.org/news/events/march-meetup-2014

REFRESHMENT OASIS

We thought “Refreshment Oasis” sounded like a nice way to say “sit down and have a cup of coffee while we keep working!”  Khronos is happy to offer a hospitality suite conveniently located next to our primary meeting room (and the official GDC Press room) to showcase Khronos Member technology demos and offer a place for GDC guests, Khronos Members and Marketing staff to meet.  You are welcome to just drop by for a chat, or please email Michelle@GoldStandardGroup.org to arrange a meeting with any Work Group Chairs, Khronos Execs or Marketing Team.

We look forward to seeing you at the show!

*Admittance to the Press Conference is open to all GDC registered Press, and to members of industry on a “Seating Available” basis.  Space is limited so reserve your seat today.

** Admittance to the KHRONOS sessions is FREE but: (1) all attendees must have a GDC Exhibitor or Conference Pass to gain entry to the Khronos meeting room area (GDC tickets details http://www.gdconf.com) and (2) all attendees MUST REGISTER for the individual Khronos API sessions. We expect demand to be high and space is limited.

With open standards becoming more important in the very diverse computer-game industry, Khronos is also growing. If you are in this industry  and want to know (or influence) the landscape for the coming years, you should attend.

Assessment of existing code-base’s quality

You found the main computation takes over 90% of the processing time, or you found the framework to be slow in general. We got in contact and discussed speeding up your software, after which we shared this assessment with you. This assessment should give you insights if the software-project is ready to be shared with Stream HPC.

The larger the code-base, the more important its code-quality

The first step is to prepare code for porting or optimization. As it’s not always easy to know what to do, we’ve defined 3 levels of code quality. The higher the quality-level, the fewer obstacles for the project and the lower the costs, the less communication is required and the fewer frustrations.

Preparing a project for porting / optimizing

The sections below discuss the 3 levels where a project can be. The goal is that you do a self-assessment and write down answers for each question with details, not only provide the final answer.

The code needs to have all levels marked in red: full level 1 and the high level of level 2. It does not matter if the existing code is written in Matlab, Python, C, C++, Assembly, OpenCL, CUDA or any other language.

The action points of all levels need to be done. When a project is not ready yet, we can assist in improving code-quality, but assume a lot has to be done by your team. This will be a separate (pre-)project, and the full estimation for the porting/optimization can only be done after that.

If not possible to level up the software or no source files are available, it will be handled as a black box project or R&D project. Do know that such projects can never be done fixed priced and are always unique. The generic part of the process is described in this blog – we are experienced in doing such focused R&D projects.

Level 1: Understandability

Goal: Can the software be understood without help from the main developers?

High level

  • Are the algorithms explained in i.e. scientific papers?
  • Do alternate implementations exist? I.e. in Python or Matlab.
  • Optional: is there a presentation/overview on the algorithm?

Mid level

  • Is there a software design document?
  • Is test-data provided that can be used to run the code?

Low level

  • Are all functions documented?
  • Is it clear what each (part of the) function is doing?
  • Are all variables documented?
  • Is it clear what each variable means?

Action points

  • A good communication-plan to get questions answered quickly. This includes a direct contact and regular calls.
  • Walk the code and improve/update the existing in-code documentation. Often it’s not looked at since it was written.

Level 2: Testability

Goal: Are there few bugs and can new bugs be easily detected?

High level

  • Is there a golden standard of the standard, such like a proof-of-concept or the existing code? Is it available to us?
  • Are the outputs deterministic? Or can they be made deterministic quickly?
  • Is there a good understanding where the algorithm is less stable to unstable, and can this be explained?
  • Is there clarity on the required maximum quantitative errors that would define the correctness of output? Are there high level test-cases for all of these?
  • Is there clarity on required maximum qualitative errors that would define the correctness of output? Is the collection of examples and counter-examples large enough to define the error in full?
  • Can the whole library tested using the sample input and output?
  • Is there clarity on the required precision? When is an answer correct?

Mid Level

  • Does the CI contain static code analysis?
  • Are there test-cases or automated tools for finding qualitative errors?

Low level

  • Are the compute-intensive functions covered by functional tests?
  • Are the most important functions covered by functional tests?
  • Are the other functions covered by functional tests?

Action points

  • Making the code deterministic by temporary removing random number generator.
  • Define what is correct output in detail.
  • Complete the test cases for the different types of errors.
  • Creating several sets of data-in and its correct data-out. This will be used for acceptance of ported code, giving the maximum errors allowed.
  • Decide which sub-results need to be compared.

Level 3: Quality

Goal: Is the software easy to maintain and extend?

High level

  • Is there build-automaton, like Cmake or Meson?
  • Is the code multi-OS?
  • Are tests run with every commit, or at least daily/nightly?

Mid level

  • Are functions defined to only do one thing and do it well?
  • Nobody of the team labelled the code as “Spaghetti code”?
  • Are function names self-explanatory?
  • Are variable-names self-explanatory?

Low level

  • There is no duplicated code?
  • There is no code that actually can be made much less complex?
  • There are no functions or methods longer than 100 lines excluding documentation?
  • There are no large classes?
  • There are no functions or methods with far more than 7 parameters?
  • There is only few commented out code?
  • There is only few dead code? This is code that is not being called from anywhere.
  • There is no code that should be rewritten?
  • There is no variables that are being reused?
  • There is no significant dependence on global state?

Action points

  • Prepare the (new) GPU-code for continuous integration.
  • Document how the ported code maps to the existing code.

Costs Assessment

Code at level 0 can take 10 – 20 times more time than code at level 1. How much exactly is difficult to say, and that’s exactly the reason we require a minimal level. The bitter pill is that the costs of not cleaning up the code is often even higher due to increased hardware costs, increased maintenance costs and lack of innovation.

There is only one reason to keep code quality under level 1: when it’s going to be replaced within a month.

When level 1 is mostly done, each missing item can take 20% to 40% of extra costs. A good example is not having good test-cases or no CPU-code available or not even an executable. This adds costs in both the implementation phase and the acceptance phase. Making a new CPU-implementation first is often cheaper.

From the described minimal level (level 1 in full, level 2 the high level) costs are more predictable and less costly. When getting a quote, these can be requested to be mentioned and you can choose to do it yourself or let us do it.

Contact

Just discuss your goals, after done an assessment. We can guide you in prioritization of making your code ready for porting.

Email to info@streamhpc.com or look to the other contact options on the contact page here. Then we’ll schedule a call and discuss what you need.

If you find this self-assessment useful, know it took us a long time to improve this document and a lot of experience is hidden in it. But as we find quality software very important, we’re releasing this list under CC-BY-NC-ND: it cannot be altered or used commercially, and it must have a clear reference to Stream HPC as the authors. In other words: it must be clear that you did not do the research or writing of this assessment yourself.

If you have feedback or suggestions, we’d really like to hear from you!

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

Porting code that uses random numbers

dobbelstenen

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

Scaling mobile GPUs to 1000 GFLOPS

arm_mali_cover_151112297646_640x360On the 20th of April 2013 there was an interesting discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage). David is a specialist in CPU and GPU architectures (twitterhomepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, but there was something about this one that made me want to share it with special attention.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash. http://www.fpgacpu.org/papers/Gray_AutumnOfMooresLaw_SingularityUniversity_11-06-23.pdf

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell, all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Continue reading “Scaling mobile GPUs to 1000 GFLOPS”

Thalesians talk – OpenCL in financial computations

End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.

After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.

Continue reading “Thalesians talk – OpenCL in financial computations”

Visit us (Amsterdam)

So we invited you over? Cool! See you soon!

The Amsterdam Stream HPC offices are located on the sixth floor of Koningin Wilhelminaplein 1 in Amsterdam, which is at the Amsterdam West Poort business area. Below you’ll find information on how to get there.

The office building

When you arrive, ask at the desk to pick you up. If you want to test out the office security, the unit is 6.01.

Getting to Koningin Wilhelminaplein 1

By Car

The office is located near the ring road A10, which makes the location easily accessible by car, via exit S107.

From the ring road A10 the complete Dutch motorway network is accessible. Taking the A10 to the South often results in a traffic jam though. See https://www.anwb.nl/verkeer for up-to-date traffic info.

Parking in parking garage is only available when you let us know in advance! There is a ParkBee at a 5 minutes walking distance – always more than enough place. Costs max €10 per day when using the Yellowbrick app or reserved via Parkbee, and about €20 per day when paid at location. Please get clarity on who pays this, in advance.

RouteTravel time (outside rush hours)
Office – Schiphol15 minutes
Office – The Hague40 minutes
Office – Utrecht35 minutes
Office – Rotterdam50 minutes
Travel time (outside rush hours)

By Public transport

The office is a 5 minute walk from Amsterdam Lelylaan. See further below for the walking route.

View in the direction of the office from the metro station

In Amsterdam the Lelylaan station is a medium sized public transport hub. It should be easy to get from any big city or any address in Amsterdam to here, as many fast trains also stop here.

  • Trains to the North: Amsterdam Central, Haarlem, North and East of the Netherlands
  • Trains to the South: Schiphol, Amsterdam Zuid, Amsterdam RAI, Utrecht, Eindhoven, Leiden and Rotterdam
  • Bus: Lines 62 (Amstel), 63 (Osdorp), 195 (Schiphol).
  • Metro: Line 50 connecting to Amsterdam train-stations Sloterdijk, Zuid, RAI and Bullewijk. In case there are problems with the train to Lelylaan/Sloterdijk, one option is to go to Amsterdam Zuid and take the metro from there. Line 51 connects to Vrije University in Amsterdam Zuid.
  • Tram: Lines 1 (Osdorp – Muiderpoort) and 17 (Osdorp – Central station).
  • Subway: Line 50, also connecting to Amsterdam train-stations Lelylaan, Zuid, RAI and Bullewijk. In case there are problems with the train to Schiphol, go to Amsterdam Zuid and take the train from there.

See https://9292.nl/station-amsterdam-lelylaan for all time tables and planning trips.

Walking from the train/metro station

Remember that in the Netherlands crossing car lanes is relatively safer than crossing biking lanes, contrary to traffic in other countries. In Dutch cities, cars break when you cross the street, while bikes simply don’t. No joke. So be sure not to walk on the red biking roads unless really necessary.

When leaving the Train station, make sure you get to the Schipluidenlaan-exit towards the South (to the right, when you see the view as on the image). This is where the buses are, not the trams. If you are at the trams area (between two car roads), go back to the station area.

When near the bus-stop, go to the roundabout to the West. Walk the whole street to the next roundabout, where you see the shiny office-building at your right side.

By Taxi

In Amsterdam you can order a taxi via +31-20-6777777 (+31-206, 6 times 7). Expect a minimum charge of €20.

At Schiphol Airport there are official taxi stands – it’ll take 15-25 minutes to get to Lelylaan outside rush hours. Make sure to tell about the roundabout-reconstruction to prevent a 10-minute longer drive.

Bicycle

For biking use https://www.route.nl/routeplanner and use “Rembrandtpark” as the end-point for the better/nicer/faster routes. From the park it’s very quick to get to the office – use a normal maps app to get to the final destination.

Starting with GROMACS and OpenCL

Gromacs-OpenCLNow that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git  plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout  via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

  • -DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
  • -DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
  • -DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
  • -DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
  • -DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
  • -DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
  • -DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
  • -DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
  • -DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
  • -DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
  • -DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

 

Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL

Stream HPC is 10 years old on 1 April 2020. Therefore we offer our one day GPGPU crash course for free that whole month.

Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

5 years ago…
Continue reading “Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL”

What does it mean to work at Stream HPC?

High-performance computing on many-core environments and low-level optimizations are very important concepts in large scientific projects nowadays. Stream HPC is one of the market’s more prominent companies active in mostly North America and Europe.

As we often get asked how it is to work at the company, we’d like to give you a little peak into our kitchen.

What we find important

We’re a close-knitted group of motivated individuals, who get a kick out of performance optimizations and are experienced in programming GPUs. Every day we have discussions on performance. Finding out why certain hardware behaves in a certain manner when a specific computing load is applied. For instance why certain code is not as fast as theoretically promised, and then finding the bottlenecks by analyzing the device and finding solutions for removing those bottlenecks. As a team we make better code than we could ever do as individuals.

Quality is important for everybody on the team, which is a whole step further than “just getting the job done”. This has a simple reason: we cannot speed up code that is of low quality. This is also why we don’t use many tools that automatically do magic, as these often miss many significant improvements and don’t improve the code quality. We don’t expect AI to dully replace us soon, but once it’s possible we’ll probably be part of that project ourselves.

Computer science in general is evolving at a fast rate and therefore learning, is an important part of the job. Reading papers, finding new articles, discussing future hardware architectures and how they would affect performance, is very important. With every project, we have to gather as much data as possible using scientific publications, interesting blog posts and code repositories in order to be on the bleeding edge of technology for our project. Why use a hammer to speedup code, when you don’t know which hammer to use best?

Our team-culture

Personality of the team

We are all kind, focused on structured problem-solving, communicative about wins and struggles, focus on group-wins above personal gains, and all gamers. To have good discussions and have good disagreements, we seek people who are also open-minded.

And we share and appreciate humor!

Tailored work environment

As we have all kinds of people in the team, who need different ways of recharging. One needs a walk, while somebody else needs a quiet place. We help each other on more than just work-related obstacles. We think that a broad approach on differences makes us understand how to progress to the next professional level the quickest. This is inclusivity-in-action, we’re proud of. Ow, and we have noise-canceling headphones.

Creating a safe place to speak up is critical for us. This helps us learn new skills and do things we never did before. And this approach helps well with all those who don’t have Asperger or ADHD at all, but need to progress without first fitting a certain norm.

Projects we do

Today we work on plenty of exciting projects and no year has been the same. Below is a page with projects we’re proud of.

https://streamhpc.com/about-us/work-we-do

Style of project handling

We use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”. Most notable are the management and documentation style.

Saying an engineer hates documentation and being managed because he/she is lazy is simply false. It’s because most management and documentation styles are far from optimal.

Pull-style management is where the tasks are written down by the team, based on the proposal. All these tasks are put into the task-list of the project, and then each team member picks the tasks that are a good fit. The last resort for the tasks that stay behind and have a deadline (being pushed) was only needed in a few cases.

All code (MR) is checked by one or two colleague, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

Where do we fit in your career?

Each job should get you forward, when done at the right moment. Question is when Stream HPC is the right choice.

As you might have seen, we don’t require a certain education. This is because a career is a sum, and an academic study can be replaced by various types of experience. The optimum is often both a study and the right type of experience. This means that for us, a senior can be a student and a junior can have been 20 years in the field.

So what is the “right type of experience”? Let’s talk about those who only have job-experience with CPUs. First, being hooked by performance, as primary interest, would be the first reason to get into HPC and GPGPU. Second, being good at C and C++ programming. Third, knowing algorithms and mathematics really well and can quickly apply them. Fourth, being a curious and quick learner, which shows by you having experimented with GPUs. This is also exactly what we test and check during the application procedure.

During your job you’ll learn anything around GPU-programming with a balance between theory and practice. Preparation is key in how we work, and this you will develop in many circumstances.

Those who left Stream HPC have gotten very senior roles, from team lead to CTO. With Stream HPC growing in size, the growth opportunities within the company are also increasing.

Make the decision for a new job

Would you like to work for a rapidly growing company of motivated GPU professionals in Europe? We seek motivated, curious, friendly people. If you liked what you read here, do check our open job positions.

OpenCL tutorial videos from Mac Research

macresearchA while ago macresearch.com stopped from existing, as David Gohara pulled the plug. Luckily the sources of a very nice tutorial were not lost, and David gave us permission to share his material.

Even if you don’t have a MAC, then these almost 5 year old materials are very helpful to understand the basics (and more) of OpenCL.

We also have the sources (chapter 4, chapter 6) and the collection of corresponding PDFs for you. All material is copyright David Gahora. If you like his style, also check out his podcasts.

Introduction to OpenCL

http://www.youtube.com/watch?v=oc1-y1V1TPQ

OpenCL fundamentals

http://www.youtube.com/watch?v=FrLqSgYyLQI

Building an OpenCL Project

http://www.youtube.com/watch?v=K7QiD74kMvU

Memory layout and Access

http://www.youtube.com/watch?v=oPE3ypaIEv4

Questions and Answers

http://www.youtube.com/watch?v=9rA6DypMsCU

Shared Memory Kernel Optimisation

http://www.youtube.com/watch?v=oFMPWuMso3Y

Did you like it? Do you have improvements on the code? Want us to share more material? Let us know in the comments, or contact us directly.

Want to learn more? Look in our knowledge base, or follow one of our  trainings.

 

ImageJ and OpenCL

For a customer I’m writing a plugin for ImageJ, a toolkit for image-processing and analysis in Java. Rick Lentz has written an OpenCL-plugin using JOCL. In the tutorial step 1 is installing the great OS Ubuntu, but that would not be the fastest way to get it going, and since JOCL is multi-platform this step should be skippable. Furthermore I rewrote most of the code, so it is a little more convenient to use.

In this blog-post I’ll explain how to get it up and running within 10 minutes with the provided information.

Continue reading “ImageJ and OpenCL”

OpenCL 2.0 book on Indiegogo

indiegogo-opencl20Edit: the project unfortunately did not get enough funding on Indiegogo

Launching a book takes a lot of effort. By using crowd funding, we hope to get the book be published much earlier and for a lower price.

[button text=”Pre-order via Indiegogo – only in August 2013″ url=”http://igg.me/at/opencl20manual” color=”orange” target=”_blank”]

What you’ll get

You will get the first OpenCL 2.0 book on market. Fully updated with the latest function-references and power-tips. Also usable for OpenCL 1.1/1.2, to help you write backward-compatible software.

Reference pages for quick access of all OpenCL function – available online and offline. This has nothing to do with Khronos reference pages of OpenCL 2.0, as this is a complete rewrite and redesign of the description of each function-definition.

Reference pages of functions

A lot of energy goes into completely revising the original OpenCL reference pages, to create real value for you. This is not just a small upgrade, but an alternative (and more complete) explanation of all the functions. Expect it to contain twice as much information.

Each function will be explained in a clear language with full explanation of background-knowledge and an example. If the function can be used in more contexts, more examples are given.

At one glance you can see what is new per OpenCL version. Also all functions are extensively tagged and grouped, so you can easily find similar functions.

Basic concepts and programming theories

Various new additions to the series of basic concepts and the series on programming theories will only be available in the book, not on the blog. These chapters will help you connect the dots and get a better overview of how OpenCL works.

This content is unique and not found anywhere else. It has its foundation in hundreds of articles and research papers, combined with the years of experience in the field as a developer and a trainer.

Hardware and Optimisation guide

An explanation of all OpenCL optimisation techniques. Including a guide how to use auto-tuning to find the best configurations for each optimisation.

How well does each optimisation work on the various architectures? The results of mini-benchmarks will give you a complete overview what helps and what not.

Tools & software

There are various tools out there – both open source and commercial. These tools make it easier to program more efficiently and faster. The top 10 of best OpenCL tools are described, even software not discussed before online.

For all contributors

Reference pages

You get access to the reference pages while I work on it. When finished, you also get a zip-file with html-files in times you don’t have access to internet. You will get updates for all 2.0 updates. You can give feedback at any time and with this you have influence on the direction the manual is going.

E-book

At all times you get a progress report with a TOC. When finished you’ll get the book sent as PDF. After some time of feedback, you’ll receive a new version. People who bought the print, will receive it with the second version.

All prices are including Dutch VAT.

Ways You Can Help

Have you supported this project? Thank you very much for your support!

Please also tell your friends and colleagues, and on Twitter, Facebook, LinkedIn!

 

 

Handling OpenCL with CMake 3.1 and higher

CMake-logoThere has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

  • make a build-directory
  • cd build
  • cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

  • Ubuntu/Debian: Get the PPA.
  • Other Linux: Get the latest tar.gz and compile.
  • Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

OpenCL Developer support by NVIDIA, AMD and Intel

There was some guy at Microsoft who understood IT very well while being a businessman: “Developers, developers, developers, developers!”. You saw it again in the mobile market and now with OpenCL. Normally I watch his yearly speech to see which product they have brought to their own ecosphere, but the developers-speech is one to watch over and over because he is so right about this! (I don’t recommend the house-remixes, because those stick in your head for weeks.)

Since OpenCL needs to be optimised for each platform, it is important for the companies that developers start developing for their platform first. StreamComputer is developing a few different Eclipse-plugins for OpenCL-development, so we were curious what was already there. Why not share all findings with you? I will keep this article updated – know this article does not cover which features are supported by each SDK.

Continue reading “OpenCL Developer support by NVIDIA, AMD and Intel”

Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is  the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]

Neil Trevett on OpenCL

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

http://www.youtube.com/watch?v=_l4QemeMSwQ

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL”

OpenCL 1.1 changes compared to 1.0

This blog-entry is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.1, but just want an overview of the most important changes differences with 1.0.

The news-release sums up the changes for 1.1 like this:

  1. New datatypes including 3-component vectors and additional formats
  2. Handling command from multiple hosts and processing buffers across multiple devices
  3. Operations on regions of a buffer including read, write and copy of 1D, 2D and 3D rectangular regions
  4. Enhanced use of events to drive and control command execution
  5. Additional OpenCL C built-in functions such as integer clamp, shuffle and asynchronous strided copies
  6. Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

Furthermore we can read the update is completely backwards-compatible with version 1.0. The obvious macro‘s CL_VERSION_1_0 and CL_VERSION_1_1 are added to handle the versioning, but what’s more? This blog-post discusses most changes and with some subjective opinions added to it.

Additional Formats

3-component vectors

We only had 2-, 4-, 8 or 16-component vectors, but not 3 which actually was somewhat strange. The functions vload3, vload_half3, vloada_half3 and vstore3, vstore_half3, vstorea_half3 have been added to the family. Watch out, that for the half-functions the offset is calculated somewhat different compared to the even-sized vectors. In version 1.0 you could have chosen for a 4-component vector when using a lot of calculations, or a struct. If you see the new function vec_step below, it seems that it is not more memory-efficient to use this vector instead of a 4-component vector.

RGB with Padding

We have support CL_RGB, CL_RGBa (= RGB with an alpha-channel) and now also RGBx (with padding-channel). The same variants are there for CL_R and CL_RG. Good for graphics-programmers, or for easier reading of 32 bpp BMPs.

Cloud Computing / Multi-user Environments

The support for different hosts gives possibilities for cloud-computing. Side-note: cloud-computing is another word for multi-user environments, with some promotion for big data-centres. All API-functions except clSetKernelArg are thread-safe now; but only when kernels are not shared between hosts; see appendix A.2 for more information. The important part is that you think clearly about how to design your software if you now can assume others can take your resources now too. OpenCL already needed a lot of planning when claiming and releasing resources, so you’re probably already mastering it; now just check more often how much resources are available.

Region-specific Operations

Regions make it possible to split a big buffers to form a queue without having to keeping track of dimensions and offsets during operations, run from host. See clCreateSubBuffer for more information. A real convenience, but watch out when writing to overlapping buffers. The functions clEnqueueCopyBufferRec, clEnqueueReadBufferRect en clEnqueueWriteBufferRect helps synchronising commands to copy, read to or write from a region.

Enhanced Control-room

My favourite description of the host is “the control-room”, since you are not over there on the device but in Houston. The more control and information, the better. The new events are clSetMemObjectDestructorCallback, clCreateUserEventclSetUserEventStatus and clSetEventCallback. The first event-listener lets you know when resources a freed, so you can keep track. User-events can be put in the event_wait_list in various functions just like the built-in events; the function will start when all events are CL_COMPLETE. With clSetEventCallback immediate actions-on-events can be programmed; combined with the user-events the programmer got some powerful tools. See the example at clSetUserEventStatus for how to use the user-events.

OpenGL events and Direct3D support

The function clCreateEventFromGLsyncKHR links a CL-event to a GL-event by just giving the name of the OpenGL-event. See gl_sharing for more info.

OpenCL has now support for Direct3D 10, which is great! This might also be a good step to make DirectCompute lighter. See cl_khr_d3d10_sharing for more info. Welcome DirectX-developers! One favour: please be aware that DirectX works on Windows only, not on Apple OSX or iOS, (Embedded) Linux or Symbian. If you use clean calls, it will be more easy to port to other platforms.

Other New Kernel-functions

The following new functions were added to the kernel:

  • get_global_offset: returns the offset of the enqueued kernels.
  • minmag and maxmag: returns the argument with the minimum or maximum distance to zero, falls back to fmin and fmax if distance is equal or an argument is NaN. Example: maxmag(-5, 3) = -5, minmag(-3, 3) = -3.
  • clamp: returns boundary-values if the given number is not between the boundaries.
  • vec_step: returns the number of elements in a scalar or a vector. A scalar returns 1, a vector 2, 4, 8 or 16. If the size is 3, the function returns 4.
  • shuffle and shuffle2: shuffles one or two vectors given another vector with the indices of the new order. Indeed plain old permutations.
  • async_workgroup_strided_copy: buffers between global and local memory on the device. When used correctly, this can overcome some of the hassle when you need to work on global memory objects, but need more speed. Correct usage is described in the reference.

The functions min and max now also work component-wise with a vector as first argument and a scalar as second. Min({2, 4, 6, 8}, 5) will give {2, 4, 5, 5}.

Conclusion

While the many revisions of OpenCL 1.0 were really minor and not a lot attention was paid to them, 1.1 is a big step forward. If you see what has been done to multi-user environments, NVidia and AMD have a lot of work to do with their drivers.

You can read in revision 33 there has been some heated discussion and there was pressure on the decision:

>>Should this extension be KHR or EXT?
PROPOSED: KHR. If this extension is to be approved by Khronos then it should be KHR, otherwise EXT. Not all platforms can support this extension, but that is also true of OpenGL interop.
RESOLVED: KHR.<<

The part “not all platforms” is very politically written down, since exactly one platform supports this specific extension. I have seen too many of these pressured discussions and I hope Khronos is stronger than i.e. ISO and OpenCL will remain as open as OpenGL.

I’m very happy with the new version, since there is more control with loads of extra events, now multiple hosts are possible, and the forgotten 3-component vector was added. Now let me know in the comments what you think of the new version.

By the way, not all is new. Deprecated are clSetCommandQueueProperty and the __ROUNDING_MODE__ macro.

Accelerating an Excel Sheet with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL”

Waiting for Mobile OpenCL – Q1 2011

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

  • Adapting code for embedded profile
  • Added macro __EMBEDDED_PROFILE__
  • CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
  • Online compiler is optional
  • No 64-bit integers
  • Reduced requirements for constant buffers, object allocation, constant argument count and local memory
  • Image & floating point support matches OpenGL ES 2.0 texturing
  • The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011”

4 October talk in Amsterdam on mobile compute

Thursday 4 October I talk on mobile compute at Hackers&Founders Amsterdam on what mobile compute can do. The goal is to initiate new ideas for start-ups, as not many know their mobile phone and tablet is very powerful and next year can be used for compute intensive tasks.

The other talk is from Mozilla on Firefox OS (Edit: it was cancelled), which is actually reason enough to visit this Hackers&Founders Meetup. Entrance is free, drinks are not. Alternatively you could go to the Hadoop User Group Meetup at Science Park, Amsterdam.

Continue reading “4 October talk in Amsterdam on mobile compute”