StreamHPC https://streamhpc.com/blog/20130603/theapplicationareasopenclandcudacanbeused/ Export date: Sat May 25 4:48:47 2019 / +0000 GMT 
The 13 application areas where OpenCL and CUDA can be usedWhich algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA? Professor Wu Feng and his group ^{1} from VirginiaTech took a close look at which types of algorithms were a good fit for vectorprocessors. This resulted in a document: "The 13 (computational) dwarfs of OpenCL" (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces. The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named "dwarfs" these algorithmic methods. With 6 more application areas in which GPUs and other vectoraccelerated processors did well, the list was completed. As a funny sidenote, in Brothers Grimm's "Snow White" there were 7 dwarfs and in Tolkien's "The Hobbit" there were 13. [list1] The computational dwarfs of OpenCL and CUDAEach part has a description of the "computational dwarf", examples of application areas and some words from the OpenCL perspective. It is not intended to be complete. It is rather a starting point. You will notice an overlap, which is normal if you take in to account that some algorithms have aspects of two or more. This also implies that some problems have now more solutions. Dense Linear AlgebraThe classic vector and matrix operations, traditionally divided into Level 1 (vector/vector), Level 2 (matrix/vector), and Level 3 (matrix/matrix) operations. Applications are various:
Normally implemented by loops, and in many cases easy to make parallels in OpenCL. Sparse Linear AlgebraMultiplication involving matrices composed primarily of zeroes. Computations can be done much more efficient by moving the nonzero elements are around the diagonal of the matrix. Applications:
There are two methods using OpenCL. Solve the problem with a series of operations, which results in a large overhead. The second method is using a series of successive approximations, minimising the errorfunction. Spectral MethodsSolving certain differential equations, often involving the use of the Fast Fourier Transform. Spectral methods can be used to solve ordinary differential equations (ODEs), partial differential equations (PDEs) and eigenvalue problems involving differential equations. Applications:
There are various FFT implementation in OpenCL for each hardware architecture. The trick is the tuning. NBody MethodsAn Nbody simulation is a simulation of a dynamical system of particles, usually under the influence of physical forces, such as gravity. Computations can be done both ways (A influences B the same B influences A) and the state of the whole system is updated after each round. The basic algorithm is O(N^2). Optimizations for larger systems are possible by neighbouradministration and leaving faraway particles out of the computation. Runtime approachselection is desirable. Applications:
OpenCLimplementations can do tens of rounds per second with millions of particles, outperforming scalar implementations with ease. Structured GridsIn a structured or regular grid all the elements in the grid have the same dimensions. Think squares and blocks. Computations that depend on neighbors in an irregular grid. Applications:
In OpenCL the grids (of workinggroups) are regular too, so mapping is quite easy. The problem remaining to be solved is how to do the communication between the between neighbors. Unstructured GridsAll grids that are not regular grids. Different elements have different number of neighbors. This group has a lot of overlap with backtracking. Applications:
The difficulty here is mapping the irregular grid on the hardware. MapReduce & Monte CarloEach process runs completely independent from one other, so nearly no communication is required between processes. In case of huge datasets and computeintensive algorithms GPUs can be used in combination with Big Data solutions like Hadoop. Applications:
As communication between the nodes is minimal, this is one of the fastest methods using GPUs. Combinational LogicThese algorithms generally involve performing simple operations on very large amounts of data, which exploit bitlevel operations. Applications:
Not all hardware is fit for these types of operations, so deviceselection is critical. Graph TraversalGraph traversal is the problem of visiting all the nodes in a graph in a particular manner, updating and/or checking their values along the way. Tree traversal is a special case of graph traversal. Has indirect lookups and little computation. Applications:
In OpenCL the trick is to keep the kernels busy enough. Dynamic ProgrammingThis is an algorithmic technique that computes solutions by solving simpler overlapping subproblems. Many dynamic programming problems operate by filling in a grid that represents the solution space, where one location in the grid holds the final answer. Applications:
As "dynamic" applies, better performance is reached when tuned during runtime. BacktrackingThis consists in building up all possible solutions and eliminating invalid ones (most times in one step), as there is no overview of all possible solutions when starting. It is effectively a depthfirst search of a problem space and therefore a special case of dynamic programming, but with a strong accent on the stepwise creation of the solutionspace. The generic solution for this group of algorithms is branchandbound. (Divide and conquer!). Applications:
If the problemspace is expected to be equally divided, a limited number of starting positions is created to walk the problem space in parallel. In OpenCL it is important is to avoid large branches. Probabilistic Graphical ModelsA graph that combines uncertainty (probabilities) and logical structure (independence constraints) to compactly represent complex, realworld phenomena. Applications:
As more processes need to update the same node (a typical case for atomics), a lot of time is consumed here. Finite State MachinesA mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. Applications:
[/list1] Why is categorizing important?Understanding which type of applications perform well makes it easier to decide when to use GPUs and other accelerators, or when to use CPUs or FPGAs. As a second step, the right hardware can be better selected when you know the right job category. So don't just buy a Tesla when you want to start with GPGPU, like others have done. For example, combinational logic is much faster on AMD GPUs. Not all of above algorithms work best on a GPU by definition: OpenCL on a CPU can be a good choice when memorybandwidth is not the bottleneck (a CPU does ~30GB/s, a GPU up to ~300GB/s). Along the same line, determining the maximum performance is easier to compare within a group of algorithms. A 4 TFLOPS accelerator can drop to 100 GFLOPS if it is hitting a wall; making it possible to compare apples with apples. Not being able to map your algorithm to one of the above groups might mean that it would be hard to program in OpenCL or CUDA. Once mapped to a group, optimizations for alike algorithms might also apply to yours... Or it might be, at least, worth investigating. We hope you have gained some valuable insights and can make steps in your investigation to speed up your software! Remember you can always contact us ^{2} with comments and questions, or ask our help to make your software optimal. Want to read offline or share? Download the brochure ^{3}. 
Links:

Post date: 20130603 19:21:32 Post date GMT: 20130603 17:21:32 Post modified date: 20171024 11:56:49 Post modified date GMT: 20171024 09:56:49 
Export date: Sat May 25 4:48:47 2019 / +0000 GMT This page was exported from StreamHPC [ https://streamhpc.com ] Export of Post and Page has been powered by [ Universal Post Manager ] plugin from www.ProfProjects.com 