We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!
Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.
Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.
Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536
Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536
Improving the basics
We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.
As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.
- CMake is now used to build the binaries. This allows the developer to choose their own IDE
- Benchmarks and Unit Tests are now implemented for each algorithm
- Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
- The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation
The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.
Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).
Monte Carlo
QMC (Sobol) Monte-Carlo method (Equity Option Example)
Below are the results of the original code ported to HIP without any initial optimisations.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.140 | 877.425 | |||
Titan V | 24.712 | 25.29 | 542.859 | 543.930 | 1595.674 | 1597.240 |
GTX980 | 75.832 | 76.852 | 1754.586 | 1755.851 | 5120.555 | 5127.255 |
Vega 20 | 11.07 | 12.161 | 19.408 | 22.578 | 36.112 | 40.575 |
MI25 (Vega 10) | 12.91 | 13.964 | 25.247 | 26.662 | 49.551 | 51.983 |
s9300 (Fiji) | 42.909 | 42.839 | 85.739 | 89.463 | 169.858 | 174.248 |
The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.
- The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
- Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
- On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs
A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.140 | 877.425 | |||
Titan V | 9.467 | 9.831 | 17.638 | 18.466 | 34.281 | 35.765 |
GTX980 | 22.578 | 23.223 | 44.923 | 46.013 | 90.108 | 100.584 |
Vega 20 | 4.456 | 5.003 | 8.570 | 10.117 | 17.003 | 19.626 |
MI25 (Vega 10) | 9.118 | 9.663 | 17.995 | 18.918 | 35.760 | 42.621 |
s9300 (Fiji) | 17.802 | 18.477 | 35.086 | 36.795 | 69.746 | 73.015 |
Black Scholes
Black-Sholes-Merton Process with Analytic European Option engine
Below are the results of the original code ported to HIP without any initial optimisations.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
Titan V | 0.095 | 5.181 | 0.407 | 25.111 | 0.792 | 49.890 |
GTX980 | 2.214 | 7.294 | 10.662 | 34.534 | 19.946 | 68.626 |
Vega 20 | 0.201 | 7.105 | 0.894 | 33.693 | 1.596 | 70.732 |
MI25 (Vega 10) | 1.090 | 7.657 | 3.453 | 41.343 | 6.000 | 74.387 |
s9300 (Fiji) | 1.048 | 11.524 | 4.635 | 53.246 | 9.170 | 129.262 |
The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.
- Use CUDA/HIP provided erf function instead of custom error function
The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
Titan V | 0.084 | 5.092 | 0.367 | 24.549 | 0.722 | 49.956 |
GTX980 | 1.458 | 6.281 | 6.402 | 30.584 | 12.790 | 61.965 |
Vega 20 | 0.114 | 6.995 | 0.397 | 33.571 | 0.775 | 69.858 |
MI25 (Vega 10) | 0.517 | 6.929 | 1.836 | 30.490 | 3.021 | 64.961 |
s9300 (Fiji) | 0.423 | 10.621 | 1.931 | 48.417 | 3.733 | 81.898 |
With the algorithm being memory-bound, the next step was to tackle the structural problems.
- Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
- This prevents transferring an entire struct where not all inputs are used
The results can be found below, where transfer times on all GPUs improved.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
Titan V | 0.068 | 3.937 | 0.286 | 18.258 | 0.565 | 36.479 |
GTX980 | 1.290 | 5.096 | 6.387 | 24.337 | 12.758 | 48.578 |
Vega 20 | 0.121 | 5.067 | 0.447 | 25.809 | 0.827 | 47.541 |
MI25 (Vega 10) | 0.506 | 4.861 | 1.841 | 23.580 | 3.115 | 53.006 |
s9300 (Fiji) | 0.444 | 7.416 | 2.002 | 36.859 | 3.922 | 64.056 |
Repo
Securities repurchase agreement
Below are the results of the original code ported to HIP without any initial optimisations.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 186.928 | 369.718 | 732.446 | |||
Titan V | 19.678 | 32.241 | 35.727 | 60.951 | 70.673 | 120.402 |
GTX980 | 387.308 | 402.682 | 767.263 | 793.159 | 1520.351 | 1578.572 |
Vega 20 | 14.771 | 37.174 | 28.595 | 69.743 | 56.699 | 131.652 |
MI25 (Vega 10) | 46.461 | 71.191 | 91.742 | 143.673 | 182.137 | 277.597 |
s9300 (Fiji) | 77.615 | 107.822 | 153.334 | 217.205 | 306.206 | 418.602 |
The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.
There are improvements possible, but these need a larger effort.
Bonds
Fixed-rate bond valuation with flat forward curve
Below are the results of the original code ported to HIP without any initial optimisations.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 241.248 | 482.187 | 952.058 | |||
Titan V | 31.918 | 49.618 | 61.209 | 97.502 | 123.750 | 195.225 |
GTX980 | 746.728 | 1117.349 | 1494.679 | 2233.761 | 2976.876 | 4470.009 |
Vega 20 | 40.112 | 66.460 | 77.123 | 127.623 | 152.657 | 250.067 |
MI25 (Vega 10) | 141.908 | 215.855 | 278.618 | 425.969 | 553.423 | 844.268 |
s9300 (Fiji) | 229.011 | 340.212 | 451.539 | 699.059 | 891.284 | 1361.436 |
The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.
The results can be found below, where 2 streams are used to transfer all the data.
Size | 262144 | 524288 | 1048576 | |||
Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
2x Intel Xeon E5-2650 v3 OpenMP | 241.248 | 482.187 | 952.058 | |||
Titan V | 31.918 | 45.988 | 61.209 | 89.198 | 123.750 | 178.688 |
GTX980 | 746.728 | 770.180 | 1494.679 | 1527.538 | 2976.876 | 3043.102 |
Vega 20 | 40.112 | 59.216 | 77.123 | 113.009 | 152.657 | 216.965 |
MI25 (Vega 10) | 141.908 | 156.164 | 278.618 | 310.922 | 553.423 | 637.924 |
s9300 (Fiji) | 229.011 | 256.373 | 451.539 | 493.679 | 891.284 | 981.604 |
Next steps
The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.
That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.
What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.
Milestones we have planned:
- Get it started + low-hanging fruit in the kernels (Done)
- Looking for structural problems outside the kernels + fixes (Done)
- High-hanging fruit for both CPU and GPU
- OpenCL / SYCL port
- Extending algorithms. Any financial company can sponsor such a development
What we can do for you
Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.
As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.
We can help your organization:
- Select the best hardware
- Build libraries that work with existing software, even Excel
- Architect larger scale software with performance in mind
- Improve performance of existing algorithms
Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.