Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Work done before

Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.

The work done for making the code, according to the team itself:

The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.

This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.

We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.

Phase 1: Focus on OpenMP and CUDA/HIP

As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.

Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.

We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.

The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.

Size 262144 524288 1048576
2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.140 877.425
Titan V 25.162 544.829 1599.428
GTX980 76.456 1753.816 5120.598
Vega 20 15.286 30.140 62.110
MI25 (Vega 10) 13.694 26.733 52.971
s9300 (Fiji) 25.853 51.403 98.484

Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.

Size 262144 524288 1048576
2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.14 877.425
Titan V 9.885 18.501 35.714
GTX980 23.427 45.81 91.721
Vega 20 10.864 21.467 42.995
MI25 (Vega 10) 10.857 21.147 41.214
s9300 (Fiji) 20.806 41.633 80.465

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.

The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.

Size 1048576 5242880 10485760
2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994
Titan V 7.959 38.852 77.443
GTX980 10.051 48.038 95.568
Vega 20 5.907 26.468 53.342
MI25 (Vega 10) 7.827 36.947 71.499
s9300 (Fiji) 9.642 37.07 80.118

On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.

So the applied fixes had good influence, but are hardly noticeable right now.

Size 1048576 5242880 10485760
2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994
Titan V 7.841 37.957 77.382
GTX980 9.12 44.473 89.847
Vega 20 5.87 27.663 51.828
MI25 (Vega 10) 7.037 31.737 64.898
s9300 (Fiji) 7.295 32.866 79.558

WIP: Repo

Fixed-rate bond valuation with flat forward curve

Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.

Size 262144 524288 1048576
2x Intel Xeon E5-2650 v3 OpenMP 186.928 369.718 732.446
Titan V 38.328 72.444 141.664
GTX980 404.359 796.748 1599.416
Vega 20 36.399 67.299 128.133

WIP: Bonds

Securities repurchase agreement

Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.

Size 262144 524288 1048576
2x Intel Xeon E5-2650 v3 OpenMP 241.248 482.187 952.058
Titan V 46.172 88.865 177.188
GTX980 761.382 1518.756 3030.603
Vega 20 63.937 122.484 241.917

Next steps

As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.

Milestones we have planned for now:

  1. Get it started + low-hanging fruit in the kernels (WIP)
  2. Looking for structural problems outside the kernels + fixes
  3. High-hanging fruit for both CPU and GPU
  4. OpenCL / SYCL port
  5. Optional: extending algorithms (by request?)

We intend to release a milestone every 6 to 8 weeks. You can get noticed by following us on Twitter or LinkedIn.

Feel free to contact us with any question.

Related Posts