Improving FinanceBench for GPUs Part II - low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.

CMake is now used to build the binaries. This allows the developer to choose their own IDE
Benchmarks and Unit Tests are now implemented for each algorithm
Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation

The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.

Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	215.311		437.140		877.425
Titan V	24.712	25.29	542.859	543.930	1595.674	1597.240
GTX980	75.832	76.852	1754.586	1755.851	5120.555	5127.255
Vega 20	11.07	12.161	19.408	22.578	36.112	40.575
MI25 (Vega 10)	12.91	13.964	25.247	26.662	49.551	51.983
s9300 (Fiji)	42.909	42.839	85.739	89.463	169.858	174.248

Benchmark-results of the original Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.

The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs

A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	215.311		437.140		877.425
Titan V	9.467	9.831	17.638	18.466	34.281	35.765
GTX980	22.578	23.223	44.923	46.013	90.108	100.584
Vega 20	4.456	5.003	8.570	10.117	17.003	19.626
MI25 (Vega 10)	9.118	9.663	17.995	18.918	35.760	42.621
s9300 (Fiji)	17.802	18.477	35.086	36.795	69.746	73.015

Benchmark-results of the improved Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.095	5.181	0.407	25.111	0.792	49.890
GTX980	2.214	7.294	10.662	34.534	19.946	68.626
Vega 20	0.201	7.105	0.894	33.693	1.596	70.732
MI25 (Vega 10)	1.090	7.657	3.453	41.343	6.000	74.387
s9300 (Fiji)	1.048	11.524	4.635	53.246	9.170	129.262

Benchmark-results of the original Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.

Use CUDA/HIP provided erf function instead of custom error function

The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.084	5.092	0.367	24.549	0.722	49.956
GTX980	1.458	6.281	6.402	30.584	12.790	61.965
Vega 20	0.114	6.995	0.397	33.571	0.775	69.858
MI25 (Vega 10)	0.517	6.929	1.836	30.490	3.021	64.961
s9300 (Fiji)	0.423	10.621	1.931	48.417	3.733	81.898

Benchmark-results of the improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

With the algorithm being memory-bound, the next step was to tackle the structural problems.

Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
This prevents transferring an entire struct where not all inputs are used

The results can be found below, where transfer times on all GPUs improved.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.068	3.937	0.286	18.258	0.565	36.479
GTX980	1.290	5.096	6.387	24.337	12.758	48.578
Vega 20	0.121	5.067	0.447	25.809	0.827	47.541
MI25 (Vega 10)	0.506	4.861	1.841	23.580	3.115	53.006
s9300 (Fiji)	0.444	7.416	2.002	36.859	3.922	64.056

Benchmark-results of the further improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Here we changed the data structures.

Repo

Securities repurchase agreement

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	186.928		369.718		732.446
Titan V	19.678	32.241	35.727	60.951	70.673	120.402
GTX980	387.308	402.682	767.263	793.159	1520.351	1578.572
Vega 20	14.771	37.174	28.595	69.743	56.699	131.652
MI25 (Vega 10)	46.461	71.191	91.742	143.673	182.137	277.597
s9300 (Fiji)	77.615	107.822	153.334	217.205	306.206	418.602

Benchmark-results of the original Repo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Did not allow for easy improvements, but needs some more extensive rewriting

The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.

There are improvements possible, but these need a larger effort.

Bonds

Fixed-rate bond valuation with flat forward curve

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	241.248		482.187		952.058
Titan V	31.918	49.618	61.209	97.502	123.750	195.225
GTX980	746.728	1117.349	1494.679	2233.761	2976.876	4470.009
Vega 20	40.112	66.460	77.123	127.623	152.657	250.067
MI25 (Vega 10)	141.908	215.855	278.618	425.969	553.423	844.268
s9300 (Fiji)	229.011	340.212	451.539	699.059	891.284	1361.436

Benchmark-results of the original Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.

The results can be found below, where 2 streams are used to transfer all the data.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	241.248		482.187		952.058
Titan V	31.918	45.988	61.209	89.198	123.750	178.688
GTX980	746.728	770.180	1494.679	1527.538	2976.876	3043.102
Vega 20	40.112	59.216	77.123	113.009	152.657	216.965
MI25 (Vega 10)	141.908	156.164	278.618	310.922	553.423	637.924
s9300 (Fiji)	229.011	256.373	451.539	493.679	891.284	981.604

Benchmark-results of the improved Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

Next steps

The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.

That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.

What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.

Milestones we have planned:

Get it started + low-hanging fruit in the kernels (Done)
Looking for structural problems outside the kernels + fixes (Done)
High-hanging fruit for both CPU and GPU
OpenCL / SYCL port
Extending algorithms. Any financial company can sponsor such a development

What we can do for you

Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.

As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.

We can help your organization:

Select the best hardware
Build libraries that work with existing software, even Excel
Architect larger scale software with performance in mind
Improve performance of existing algorithms

Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.

StreamHPC communications

Improving FinanceBench for GPUs Part II – low hanging fruit

Improving the basics

Monte Carlo

Black Scholes

Repo

Bonds

Next steps

What we can do for you

StreamHPC communications

Improving the basics

Monte Carlo

Black Scholes

Repo

Bonds

Next steps

What we can do for you

Discover more from StreamHPC