In the first part of our two-part blog series, we have discussed how parallel computing applications can best use pseudo-random number generators (PRNGs) so as to benefit from parallel computing speedups, without negatively impacting the statistical properties of the random numbers generated. We have argued that index-based PRNGs (e.g., from the Random123 library), which do not maintain any state but instead take an index and a key as input and return the random number corresponding to the index in its random output sequence, provide numerous benefits in parallel programming. This week, we discuss both an application of index-based PRNGs and another important aspect of PRNGs in parallel environments: that of reproducing the same output among different parallel implementations or among a parallel and a serial implementation. We focus on the latter and consider reproducibility for the purpose of verification, which – as we have stated last week – is an important matter for our customers.
Reproducibility
Reproducibility may be simple if random numbers are only used in the initialization phase of the application. In this case, it may be sensible to record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach for random-number generation in the parallel code (as discussed in Part 1 of our blog). Indeed, this is a convenient approach for both us and our customers as we do not have to deal with PRNG implementation internals and our customer can independently verify the correctness of the parallel implementation. We have used this for several of our projects.
More complex scenarios such as stochastic simulations and Monte Carlo applications require a continuous, parallel generation of random numbers. In this case, reproducibility at first seems challenging considering the concurrent and thus unpredictable order in which parallel code may invoke PRNGs. Indeed, it may be impossible to achieve when using traditional, seed-based PRNGs. This is because we may need to access entries in the PRNG’s output sequence in an arbitrary order rather than sequentially. Essentially, we need to define a one-to-one mapping between random numbers generated in the serial code and those used in the parallel version and then be able to ask the PRNG for the first entry, the second entry, and so on, in both implementations. The random-access capability is exactly what index-based PRNGs provide. The main challenge lies in the definition of the mapping.
We may consider two options for defining the one-to-one mapping between random numbers generated in a serial code and those created in a parallel implementation. The summary of our findings is this:
- Serial-to-parallel mapping: This amounts to an emulation of the serial random number generation in the parallel code. It generally requires minimal or no changes in the serial, original code, but may be increasingly difficult or impossible to define as the complexity of the code grows.
- Parallel-to-serial mapping: This emulates the parallel random number generation in the serial code. It is usually easier to define than the serial-to-parallel mapping but requires deeper changes in the serial code.
We explain below what we mean by that.
Consider a simple scenario where we have some serial code that generates a new set of random numbers in each iteration of a loop using a traditional seed-based PRNG. A parallel implementation may operate by executing each iteration via a single work item. Using traditional PRNGs, we may equip each work item with its own PRNG seed.
- Using a serial-to-parallel mapping, we may then record the PRNG state at each iteration of the loop in the serial code and initialize the PRNG state in each work item of the parallel code similarly to a table-based approach – this does not require any changes in the serial code (other than temporarily for recording the random numbers). Alternatively, we may replace the PRNG in the serial code with an index-based one and use a single, static counter for the PRNG, which we increment with each invocation of the PRNG – this is a minimal change in the serial code. We reflect the index-based PRNG in the parallel code and initialize each work item’s counter with w·n, where w is the work item index and n denotes the size of the set of random numbers created per loop. In both cases, the serial and the parallel code should produce identical outputs.
- For a parallel-to-serial mapping, it is easier to assume that the parallel implementation uses index-based PRNGs. Let’s assume that each work item uses indices of the form w·n+j for the j-th item in the set of random numbers generated. Then the serial code is adapted to use the same PRNG and computes indices as WI(i)·n+j, where WI(i) returns the work item index corresponding to the i-th iteration of the loop. Again, the serial and parallel implementations should then produce identical outputs.
A proper mapping becomes increasingly difficult to define when the code complexity grows. However, it is usually easier to map the parallel random number generation to the serial code. Indeed, consider the case where some random numbers are generated in a data-dependent manner, i.e., only if some condition on the input data is fulfilled. Then it is impossible to give a predefined mapping from the serial to the parallel code. We therefore prefer the serial-to-parallel mapping whenever it is easy to define (as we don’t risk or minimize the risk of introducing bugs in the original code by changing the PRNG generation) but resort to the parallel-to-serial mapping for more difficult cases.
In the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.



For years we haven been complaining on this blog what AMD was lacking and what needed to be improved. And as you might have concluded from the title of this blogpost, there has been a lot of progress.
Are you around at ISC and have an opinion on portable open standards? Then you should join the discussion with other professionals at ISC. Some suggestions for discussions:

We have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.
OpenCL header files
Want to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.
10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.
Tesla K80 (Kepler)
Radeon Nano and FirePro S9300X2 (Fiji)
XeonPhi Knights landing
During the panel discussion some very interesting questions were asked, I’d like to share with you.

The coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamHPC team! For appointments, send an email to 






If there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.




At the university of Newcastle they use OpenCL for researching the performance balance between software and hardware. This resource management isn’t limited to shared memory systems, but extends to mixed architectures where batches of co-processors and other resources make it much more complex problem to solve. They chose OpenCL as it gives both inter-node and intra-node resource-management.