Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.

  1. CMake is now used to build the binaries. This allows the developer to choose their own IDE
  2. Benchmarks and Unit Tests are now implemented for each algorithm
  3. Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
  4. The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation

The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.

Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Below are the results of the original code ported to HIP without any initial optimisations.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP215.311437.140877.425
Titan V24.71225.29542.859543.9301595.6741597.240
GTX98075.83276.8521754.5861755.8515120.5555127.255
Vega 2011.0712.16119.40822.57836.11240.575
MI25 (Vega 10)12.9113.96425.24726.66249.55151.983
s9300 (Fiji)42.90942.83985.73989.463169.858174.248
Benchmark-results of the original Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.

  • The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
  • Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
  • On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs

A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP215.311437.140877.425
Titan V9.4679.83117.63818.46634.28135.765
GTX98022.57823.22344.92346.01390.108100.584
Vega 204.4565.0038.57010.11717.00319.626
MI25 (Vega 10)9.1189.66317.99518.91835.76042.621
s9300 (Fiji)17.80218.47735.08636.79569.74673.015
Benchmark-results of the improved Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)
less is better
less is better

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Below are the results of the original code ported to HIP without any initial optimisations.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994
Titan V0.0955.1810.40725.1110.79249.890
GTX9802.2147.29410.66234.53419.94668.626
Vega 200.2017.1050.89433.6931.59670.732
MI25 (Vega 10)1.0907.6573.45341.3436.00074.387
s9300 (Fiji)1.04811.5244.63553.2469.170129.262
Benchmark-results of the original Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.

  • Use CUDA/HIP provided erf function instead of custom error function

The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994
Titan V0.0845.0920.36724.5490.72249.956
GTX9801.4586.2816.40230.58412.79061.965
Vega 200.1146.9950.39733.5710.77569.858
MI25 (Vega 10)0.5176.9291.83630.4903.02164.961
s9300 (Fiji)0.42310.6211.93148.4173.73381.898
Benchmark-results of the improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

With the algorithm being memory-bound, the next step was to tackle the structural problems.

  • Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
  • This prevents transferring an entire struct where not all inputs are used

The results can be found below, where transfer times on all GPUs improved.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994
Titan V0.0683.9370.28618.2580.56536.479
GTX9801.2905.0966.38724.33712.75848.578
Vega 200.1215.0670.44725.8090.82747.541
MI25 (Vega 10)0.5064.8611.84123.5803.11553.006
s9300 (Fiji)0.4447.4162.00236.8593.92264.056
Benchmark-results of the further improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Here we changed the data structures.
Less is better
Less is better
Less is better

Repo

Securities repurchase agreement

Below are the results of the original code ported to HIP without any initial optimisations.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP186.928369.718732.446
Titan V19.67832.24135.72760.95170.673120.402
GTX980387.308402.682767.263793.1591520.3511578.572
Vega 2014.77137.17428.59569.74356.699131.652
MI25 (Vega 10)46.46171.19191.742143.673182.137277.597
s9300 (Fiji)77.615107.822153.334217.205306.206418.602
Benchmark-results of the original Repo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Did not allow for easy improvements, but needs some more extensive rewriting

The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.

There are improvements possible, but these need a larger effort.

Bonds

Fixed-rate bond valuation with flat forward curve

Below are the results of the original code ported to HIP without any initial optimisations.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP241.248482.187952.058
Titan V31.91849.61861.20997.502123.750195.225
GTX980746.7281117.3491494.6792233.7612976.8764470.009
Vega 2040.11266.46077.123127.623152.657250.067
MI25 (Vega 10)141.908215.855278.618425.969553.423844.268
s9300 (Fiji)229.011340.212451.539699.059891.2841361.436
Benchmark-results of the original Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.

The results can be found below, where 2 streams are used to transfer all the data.

Size2621445242881048576
 Compute+ TransfersCompute+ TransfersCompute+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP241.248482.187952.058
Titan V31.91845.98861.20989.198123.750178.688
GTX980746.728770.1801494.6791527.5382976.8763043.102
Vega 2040.11259.21677.123113.009152.657216.965
MI25 (Vega 10)141.908156.164278.618310.922553.423637.924
s9300 (Fiji)229.011256.373451.539493.679891.284981.604
Benchmark-results of the improved Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)
Less is better
Less is better

Next steps

The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.

That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.

What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.

Milestones we have planned:

  1. Get it started + low-hanging fruit in the kernels (Done)
  2. Looking for structural problems outside the kernels + fixes (Done)
  3. High-hanging fruit for both CPU and GPU
  4. OpenCL / SYCL port
  5. Extending algorithms. Any financial company can sponsor such a development

What we can do for you

Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.

As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.

We can help your organization:

  • Select the best hardware
  • Build libraries that work with existing software, even Excel
  • Architect larger scale software with performance in mind
  • Improve performance of existing algorithms

Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.