A selection of Projects

From latest to oldest (2014):

Speeding up special purpose camera on mobile phones [C++, OpenCL, Vulkan]. Increasing the frame rate from stuttering frames to a responsive video-stream on a smartphone, made it possible to use the camera in new application areas.
Speeding up Generative AI software on MacOS [Objective C++, Metal]. Using MAC Studios with M1 and M2 chips, we reached theoretical max performance for offline Generative AI (making nice pictures).
Achieve 1 PFLOPS Attention on A Single H100 SXM [C++, CUDA]. We built the world’s first 1PF+ performance for the Attention-algorithm.
Writing a Compiler Test Suite for a C++ kernel language [OpenCL, C, C++]. For a large vendor we provided an extensive suite of tests to make sure the compiler is according specs. We made that update, which was a big change from 2.1 because of the addition of C++ kernels.
Porting GROMACS, OpenMM, AMBER and more to AMD MI100 GPUs [HIP, SYCL, C++, …]. AMD got awarded various supercomputers in 2022 and 2023 to use their GPUs, and it was therefore crucial to make sure that popular CUDA-optimized software would shine on AMD MI100 GPUs. While we were busy optimizing code, it also ran faster on Nvidia GPUs – this means the comparisons between Nvidia and AMD are fair, and not influenced by single-sided optimizations. If you run one of these softwares on your local supercomputer – you’re welcome. One example: Efficient molecular dynamics simulations on LUMI
Building the Khronos OpenCL SDK [OpenCL, C, C++]. It was always a wish to make OpenCL more than just the language. So we were happy when awarded Github
Speeding up pyPasWAS 3 to 5x [C, Python, OpenCL]. We boldly claimed that we could speed up this open-source software to do DNA/RNA/protein sequence alignment and trimming, and so we did. Speedup depends on the data. Read more on the blog
Building multiple libraries for AMD [HIP, C++]. Several foundational libraries on ROCm Github were built by us, and we still maintain. This project is still active.
- rocRAND [HIP, C++]. The world’s fastest random number generator (or second, depending on Nvidia’s response) is built for AMD GPUs, and it’s open source. With random numbers generated at several hundreds of gigabytes per second, the library makes it possible to speed up existing code numerous times. The code is often faster than Nvidia’s cuRAND and is therefore the preferred library to be used on any high-end GPU.
- rocThrust – AMD’s optimized version of Thrust [HIP, C++]. Highly optimized for CDNA GPUs. Lots of software for CUDA is Thrust based, and now has no lock-in anymore.
- hipCUB – AMD’s optimized version of CUB [HIP, C++]. Highly optimized for CDNA GPUs. Now porting CUB-based software to AMD is a lot simpler. Both rocThrust and hipCUB share a library rocPRIM which unites many of the GPU-primitives.
Porting a set of ADSL-algorithms to an embedded special purpose GPU [OpenCL, C, C++]. Allowing central ADSL-routers in large buildings to handle modern ADSL-protocols.
Optimizing and extending the main image processing framework of a large photo hosting platform [CUDA, C++, AWS]. This project is still active. Here we make sure that nobody notices that the original photos are optimized for the current screen on-the-fly, while also providing additional filters and features.
Flooding simulation [OpenCL, C++, MPI]. Software that simulates flooding of land, which we ported to multi-GPU on OpenCL and got a 35x speedup over MPI. Read more on the blog
Further speeding up CUDA-enabled Quantum Chemistry software [CUDA, C++], a general purpose quantum chemistry software, called TeraChem, designed to run on NVIDIA GPU architectures. Our work resulted in adding an extra 70% performance to the already optimized CUDA-code.
Porting Manchester University’s UNIFAC to OpenCL on XeonPhi [OpenCL, C++, MPI]. Even though XeonPhi Knights Corner is not a very performant accelerator, we managed to get a 160x speedup, starting from single threaded code. Most of the speedup is due to clever code-optimizations and less due to low-level optimizations. Where OpenMP could get the single threaded code down to about 8 seconds, we brought it down to 0.062 seconds. Read more on the blog
Porting Gromacs from CUDA to OpenCL [CUDA, OpenCL, C, C++]. Until we ported the simulation software end of 2014, it has been CUDA-only. Porting took several man-months to manually port all code. You can now download the source, build it and run it on AMD/Intel hardware. All is open source, so you can see our code. Read more on the blog. The backend has been deprecated in favor of SYCL.

We have helped many more companies become competitive. Some we could vaguely describe, and some we can’t mention. See below the programming languages we worked with, as not all show up in the above list.

Software Performance is a Competitive Advantage

We are in the niche of GPGPU-computing, where GPUs are programmed to efficiently run scientific and large-scale simulations, AI training/inference and other mathematical compute-intensive software. As a recognized expert, customers from mostly US and Europe trust us to speed up their software.

Our projects range from several person-weeks to fix software performance problems, to several person-years to build extensive high performance software and libraries.

Join a growing list of companies that trust us with designing and building their core software with performance in mind.

A selection of Projects

Technologies we work with

Want to know more? Get in contact!