Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Work done before

Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.

The work done for making the code, according to the team itself:

The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.

This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.

We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.

Phase 1: Focus on OpenMP and CUDA/HIP

As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.

Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.

We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.

The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	215.311	437.140	877.425
Titan V	25.162	544.829	1599.428
GTX980	76.456	1753.816	5120.598
Vega 20	15.286	30.140	62.110
MI25 (Vega 10)	13.694	26.733	52.971
s9300 (Fiji)	25.853	51.403	98.484

Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	215.311	437.14	877.425
Titan V	9.885	18.501	35.714
GTX980	23.427	45.81	91.721
Vega 20	10.864	21.467	42.995
MI25 (Vega 10)	10.857	21.147	41.214
s9300 (Fiji)	20.806	41.633	80.465

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.

The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.

Size	1048576	5242880	10485760
2x Intel Xeon E5-2650 v3 OpenMP	5.005	22.194	43.994
Titan V	7.959	38.852	77.443
GTX980	10.051	48.038	95.568
Vega 20	5.907	26.468	53.342
MI25 (Vega 10)	7.827	36.947	71.499
s9300 (Fiji)	9.642	37.07	80.118

On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.

So the applied fixes had good influence, but are hardly noticeable right now.

Size	1048576	5242880	10485760
2x Intel Xeon E5-2650 v3 OpenMP	5.005	22.194	43.994
Titan V	7.841	37.957	77.382
GTX980	9.12	44.473	89.847
Vega 20	5.87	27.663	51.828
MI25 (Vega 10)	7.037	31.737	64.898
s9300 (Fiji)	7.295	32.866	79.558

WIP: Repo

Fixed-rate bond valuation with flat forward curve

Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	186.928	369.718	732.446
Titan V	38.328	72.444	141.664
GTX980	404.359	796.748	1599.416
Vega 20	36.399	67.299	128.133

WIP: Bonds

Securities repurchase agreement

Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	241.248	482.187	952.058
Titan V	46.172	88.865	177.188
GTX980	761.382	1518.756	3030.603
Vega 20	63.937	122.484	241.917

Next steps

As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.

Milestones we have planned for now:

Get it started + low-hanging fruit in the kernels (WIP)
Looking for structural problems outside the kernels + fixes
High-hanging fruit for both CPU and GPU
OpenCL / SYCL port
Optional: extending algorithms (by request?)

We intend to release a milestone every 6 to 8 weeks. You can get noticed by following us on Twitter or LinkedIn.

Feel free to contact us with any question.

StreamHPC communications