Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.

  1. CMake is now used to build the binaries. This allows the developer to choose their own IDE
  2. Benchmarks and Unit Tests are now implemented for each algorithm
  3. Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
  4. The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation

The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.

Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Below are the results of the original code ported to HIP without any initial optimisations.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.140 877.425
Titan V 24.712 25.29 542.859 543.930 1595.674 1597.240
GTX980 75.832 76.852 1754.586 1755.851 5120.555 5127.255
Vega 20 11.07 12.161 19.408 22.578 36.112 40.575
MI25 (Vega 10) 12.91 13.964 25.247 26.662 49.551 51.983
s9300 (Fiji) 42.909 42.839 85.739 89.463 169.858 174.248
Benchmark-results of the original Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.

  • The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
  • Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
  • On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs

A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.140 877.425
Titan V 9.467 9.831 17.638 18.466 34.281 35.765
GTX980 22.578 23.223 44.923 46.013 90.108 100.584
Vega 20 4.456 5.003 8.570 10.117 17.003 19.626
MI25 (Vega 10) 9.118 9.663 17.995 18.918 35.760 42.621
s9300 (Fiji) 17.802 18.477 35.086 36.795 69.746 73.015
Benchmark-results of the improved Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

less is better

less is better

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Below are the results of the original code ported to HIP without any initial optimisations.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994
Titan V 0.095 5.181 0.407 25.111 0.792 49.890
GTX980 2.214 7.294 10.662 34.534 19.946 68.626
Vega 20 0.201 7.105 0.894 33.693 1.596 70.732
MI25 (Vega 10) 1.090 7.657 3.453 41.343 6.000 74.387
s9300 (Fiji) 1.048 11.524 4.635 53.246 9.170 129.262
Benchmark-results of the original Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.

  • Use CUDA/HIP provided erf function instead of custom error function

The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994
Titan V 0.084 5.092 0.367 24.549 0.722 49.956
GTX980 1.458 6.281 6.402 30.584 12.790 61.965
Vega 20 0.114 6.995 0.397 33.571 0.775 69.858
MI25 (Vega 10) 0.517 6.929 1.836 30.490 3.021 64.961
s9300 (Fiji) 0.423 10.621 1.931 48.417 3.733 81.898
Benchmark-results of the improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

With the algorithm being memory-bound, the next step was to tackle the structural problems.

  • Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
  • This prevents transferring an entire struct where not all inputs are used

The results can be found below, where transfer times on all GPUs improved.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994
Titan V 0.068 3.937 0.286 18.258 0.565 36.479
GTX980 1.290 5.096 6.387 24.337 12.758 48.578
Vega 20 0.121 5.067 0.447 25.809 0.827 47.541
MI25 (Vega 10) 0.506 4.861 1.841 23.580 3.115 53.006
s9300 (Fiji) 0.444 7.416 2.002 36.859 3.922 64.056
Benchmark-results of the further improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Here we changed the data structures.

Less is better

Less is better

Less is better

Repo

Securities repurchase agreement

Below are the results of the original code ported to HIP without any initial optimisations.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 186.928 369.718 732.446
Titan V 19.678 32.241 35.727 60.951 70.673 120.402
GTX980 387.308 402.682 767.263 793.159 1520.351 1578.572
Vega 20 14.771 37.174 28.595 69.743 56.699 131.652
MI25 (Vega 10) 46.461 71.191 91.742 143.673 182.137 277.597
s9300 (Fiji) 77.615 107.822 153.334 217.205 306.206 418.602
Benchmark-results of the original Repo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Did not allow for easy improvements, but needs some more extensive rewriting

The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.

There are improvements possible, but these need a larger effort.

Bonds

Fixed-rate bond valuation with flat forward curve

Below are the results of the original code ported to HIP without any initial optimisations.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 241.248 482.187 952.058
Titan V 31.918 49.618 61.209 97.502 123.750 195.225
GTX980 746.728 1117.349 1494.679 2233.761 2976.876 4470.009
Vega 20 40.112 66.460 77.123 127.623 152.657 250.067
MI25 (Vega 10) 141.908 215.855 278.618 425.969 553.423 844.268
s9300 (Fiji) 229.011 340.212 451.539 699.059 891.284 1361.436
Benchmark-results of the original Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.

The results can be found below, where 2 streams are used to transfer all the data.

Size 262144 524288 1048576
  Compute + Transfers Compute + Transfers Compute + Transfers
2x Intel Xeon E5-2650 v3 OpenMP 241.248 482.187 952.058
Titan V 31.918 45.988 61.209 89.198 123.750 178.688
GTX980 746.728 770.180 1494.679 1527.538 2976.876 3043.102
Vega 20 40.112 59.216 77.123 113.009 152.657 216.965
MI25 (Vega 10) 141.908 156.164 278.618 310.922 553.423 637.924
s9300 (Fiji) 229.011 256.373 451.539 493.679 891.284 981.604
Benchmark-results of the improved Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

Less is better

Less is better

Next steps

The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.

That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.

What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.

Milestones we have planned:

  1. Get it started + low-hanging fruit in the kernels (Done)
  2. Looking for structural problems outside the kernels + fixes (Done)
  3. High-hanging fruit for both CPU and GPU
  4. OpenCL / SYCL port
  5. Extending algorithms. Any financial company can sponsor such a development

What we can do for you

Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.

As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.

We can help your organization:

  • Select the best hardware
  • Build libraries that work with existing software, even Excel
  • Architect larger scale software with performance in mind
  • Improve performance of existing algorithms

Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.

Related Posts

    Improving FinanceBench

    ...  you're into computational finance, you might have heard of FinanceBench. It's a benchmark developed at the University of Deleware ...