# Parallel Programming and High-Performance Computing - PDF

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

## Psychology

Published:

Views: 6 | Pages: 51

Extension: PDF | Download: 0

Share
Related documents
Description
Parallel Programming and High-Performance Computing Part 7: Examples of Parallel Algorithms Dr. Ralf-Peter Mundani CeSIM / IGSSE / CiE Technische Universität München Overview matrix operations JACOBI and
Transcript
Parallel Programming and High-Performance Computing Part 7: Examples of Parallel Algorithms Dr. Ralf-Peter Mundani CeSIM / IGSSE / CiE Technische Universität München Overview matrix operations JACOBI and GAUSS-SEIDEL iterations sorting Everything that can be invented has been invented. Charles H. Duell commissioner U.S. Office of Patents, 99 7 Matrix Operations reminder: matrix underlying basis for many scientific problems is a matrix stored as -dimensional array of numbers (integer, float, double) row-wise in memory (typical case) column-wise in memory typical matrix operations (K: set of numbers) ) A + B = C with A, B, and C K N M ) Ab = c with A K N M, b K M, and c K N ) AB = C with A K N M, B K M L, and C K N L matrix-vector multiplication () and matrix multiplication () are main building blocks of numerical algorithms both pretty easy to implement as sequential code what happens in parallel? 7 Matrix Operations matrix-vector multiplication appearances systems of linear equations (SLE) Ax = b iterative methods for solving SLEs (conjugate gradient, e. g.) implementation of neural networks (determination of output values, training neural networks) standard sequential algorithm for A K N N and b, c K N for i to N do c[i] 0; for j to N do c[i] c[i] + A[i][j]*b[j]; od od for full matrix A this algorithm has a complexity of Ο(N ) 7 4 Matrix Operations matrix-vector multiplication (cont d) in parallel, there are three main options to distribute data among P procs row-wise block-striped decomposition: each process is responsible for a contiguous part of about N/P rows of A column-wise block-striped decomposition: each process is responsible for a contiguous part of about N/P columns of A checkerboard block decomposition: each process is responsible for a contiguous block of matrix elements vector b may be either replicated or block-decomposed itself row-wise column-wise checkerboard 7 5 7 6 Matrix Operations matrix-vector multiplication (cont d) row-wise block-striped decomposition probably the most straightforward approach each process gets some rows of A and entire vector b each process computes some components of vector c build and replicate entire vector c (gather-to-all, e. g.) complexity of Ο(N /P) multiplications / additions for P processes = 7 7 Matrix Operations matrix-vector multiplication (cont d) column-wise block-striped decomposition less straightforward approach each process gets some columns of A and respective elements of vector b each process computes partial results of vector c build and replicate entire vector c (all-reduce or maybe a reduce-scatter if processes do not need entire vector c) complexity is comparable to row-wise approach = o o o o o o 7 Matrix Operations matrix-vector multiplication (cont d) checkerboard block decomposition each process gets some block of elements of A and respective elements of vector b each process computes some partial results of vector c build and replicate entire vector c (all-reduce, but unused elements of vector c have to be initialised with zero) complexity of the same order as before; it can be shown that checkerboard approach has slightly better scalability properties (increasing P does not require to increase N, too) = o o o Matrix Operations matrix multiplication appearances computational chemistry (computing changes of state, e. g.) signal processing (DFT, e. g.) standard sequential algorithm for A, B, C K N N for i to N do for j to N do C[i][j] 0 for k to N do C[i][j] C[i][j] + A[i][k]*B[k][j]; od od od for full matrices A and B this algorithm has a complexity of Ο(N ) 7 9 7 0 Matrix Operations matrix multiplication (cont d) naïve parallelisation each process gets some rows of A and entire matrix B each process computes some rows of C problem: once N reaches a certain size, matrix B won t fit completely into cache and / or memory performance will dramatically decrease remedy: subdivision of matrix B instead of whole matrix B = Matrix Operations matrix multiplication (cont d) recursive algorithm algorithm follows the divide-and-conquer principle subdivide both matrices A and B into four smaller submatrices A00 A0 B00 B0 A = B = A0 A B0 B hence, the matrix multiplication can be computed as follows C = A A 00 0 B B A + A 0 B B 0 0 A if blocks are still too large for the cache, repeat this step (i. e. recursively subdivide) until it fits furthermore, this method has significant potential for parallelisation (especially for MemMS) A 00 0 B B A A 0 B B 7 7 Matrix Operations matrix multiplication (cont d) CANNON s algorithm each process gets some rows of A and some columns of B each process computes some components of matrix C different possibilities for assembling the result gather all data, build and (maybe) replicate matrix C pump data onward to next process ( systolic array) complexity of Ο(N /P) multiplications / additions for P processes = Overview matrix operations JACOBI and GAUSS-SEIDEL iterations sorting 7 JACOBI and GAUSS-SEIDEL Iterations scenario solve an elliptic partial differential equation (PDE) with DIRICHLET boundary conditions on a given domain Ω simple example: POISSON equation Δu = f u( x, y) u( x, y) () Δu ( x, y) = = f( x, y) for (x,y) Ω x y on the unit square Ω =]0,[ with u given on the boundary of Ω task: u(x,y) or an approximation to it has to be found occurrences of such PDEs fixed membrane stationary heat equation (picture) electrostatic fields 7 4 7 5 JACOBI and GAUSS-SEIDEL Iterations discretisation for solving our example PDE, a discretisation is necessary typical discretisation techniques finite differences finite volumes finite elements finite difference discretisation (forward and backward differences) for the second derivatives in () for a mesh width h leads to ), ( ), ( ), ( ), ( h y h x u y x u y h x u x y x u + + ), ( ), ( ), ( ), ( h h y x u y x u h y x u y y x u + + () JACOBI and GAUSS-SEIDEL Iterations discretisation (cont d) for computational solution of our problem a D grid is necessary equidistant grid with (N+) (N+) grid points, N = /h u i,j u(ih, jh) with i, j = 0,, N () u( x, y) x hence, () can be written as u ui-,j u iji, j + + h,, u( x, y) ui, j- ui,j + ui,j+ y h with () and an appropriate discretisation of f(x,y) follows (4) u i,j u i,j + 4u i,j u i+,j u i,j+ = h f i,j 0 i, j N resulting equation on the boundary u i,j = g i,j i, j = 0 or i, j = N inner point boundary point 7 6 JACOBI and GAUSS-SEIDEL Iterations system of linear equations for each inner point there is one linear equation according to (4) equations in points next to the boundary (i. e. i, j = or i, j = N ) access boundary values g i,j these are shifted to the right-hand side of the equation hence, all unknowns are located left, all known quantities right assembling of the overall vector of unknowns by lexicographic row-wise ordering j (y) 5-point difference star (i,j+),,, (i,j),,, (i,j) (i+,j),,, i (x) (i,j ) 7 7 7 JACOBI and GAUSS-SEIDEL Iterations system of linear equations (cont d) this results to a system of linear equations Ax = b with (N ) equations in (N ) unknowns matrix A has block-tridiagonal structure and is sparse (5) =,N N,, N,,N N,, N, f f f f u u u u M M M M M M O O O O O O O O = A = x = b JACOBI and GAUSS-SEIDEL Iterations solving large sparse SLEs schoolbook method for solving SLEs: GAUSSian elimination direct solver that provides the exact solution has a complexity of Ο(M ) for M unknowns (!) does not exploit sparsity of matrix A that is even filled-up (i. e. existing zeros are destroyed ) during solution hence, using some iterative method instead approximates the exact solution has a complexity of Ο(M) operations for a single iteration typically much less than Ο(M ) iteration steps needed ideal case of Ο() steps for multigrid or multilevel methods basic methods (number of steps depending on M) relaxation methods: JACOBI, GAUSS-SEIDEL, SOR minimisation methods: steepest descent, CG 7 9 JACOBI and GAUSS-SEIDEL Iterations JACOBI iteration decompose matrix A in its diagonal part D A, its upper triangular part U A, and its lower triangular part L A starting with A = L A + D A + U A b = Ax = D A x + (L A + U A )x and writing b = D A x (T+) + (L A + U A )x (T) with x (T) denoting the approximation to x after T steps of the iteration leads to the following iterative scheme x (T+) = D A (L A + U A )x (T) + D A b = x (T) + D A r (T) where the residual is defined as r (T) = b Ax (T) 7 0 JACOBI and GAUSS-SEIDEL Iterations JACOBI iteration (cont d) algorithmic form of the JACOBI iteration for T 0,,, do for k to M do x k (T+) /A k,k *(b k j k A k,j *x j (T) ) od od for our example with matrix A according to (5) this means for T 0,,, do for j to N do for i to N do u i,j (T+) ¼*(u i,j (T) + u i,j (T) + u i+,j (T) + u i,j+ (T) h *f i,j ) od od od 7 JACOBI and GAUSS-SEIDEL Iterations GAUSS-SEIDEL iteration same decomposition of matrix A as for JACOBI starting with A = L A + D A + U A b = Ax = (D A + L A )x + U A x and writing b = (D A + L A )x (T+) + U A x (T) leads to the following iterative scheme x (T+) = (D A + L A ) U A x (T) + (D A + L A ) b = x (T) + (D A + L A ) r (T) where the residual is defined as r (T) = b Ax (T) 7 JACOBI and GAUSS-SEIDEL Iterations GAUSS-SEIDEL iteration (cont d) algorithmic form of the GAUSS-SEIDEL iteration for T 0,,, do for k to M do x k (T+) /A k,k *(b k j k A k,j *x j (T+) j k A k,j *x j (T) ) od od for our example with matrix A according to (5) this means for T 0,,, do for j to N do for i to N do u i,j (T+) ¼*(u i,j (T+) + u i,j (T+) + u i+,j (T) + u i,j+ (T) h *f i,j ) od od od 7 JACOBI and GAUSS-SEIDEL Iterations parallelisation of JACOBI iteration neither JACOBI nor GAUSS-SEIDEL are used today very frequently for solving large SLEs (they are too slow) nevertheless, the algorithmic aspects are still of interest a parallel JACOBI is quite straightforward in iteration step T only values from step T are used hence, all updates of one step can be made in parallel furthermore, subdivide the domain into strips or blocks, e. g. j step T P P P j i P 4 P 6 step T i P 7 P P 9 7 4 JACOBI and GAUSS-SEIDEL Iterations parallelisation of JACOBI iteration (cont d) for its computations, each processor P needs a subset of boundary values (if P is adjacent to the boundary) one row / column of values from P s neighbouring processes a global / local termination condition hence, each processor has to execute the following algorithm ) update all local approximate values u (T) i,j to u (T+) i,j ) send all updates ( ) to the respective neighbouring processes ) receive all necessary updates ( ) from neighbouring processes 4) compute local residual values and perform a reduce-all for global residual 5) continue if global residual is larger than some threshold value ε 7 5 JACOBI and GAUSS-SEIDEL Iterations parallelisation of GAUSS-SEIDEL iteration problem: since the updated values are immediately used where available, parallelisation seems to be quite complicated j updated values (step T) old values (step T ) i hence, a different order of visiting / updating the grid points is necessary wavefront ordering red-black or checkerboard ordering 7 6 JACOBI and GAUSS-SEIDEL Iterations parallelisation of GAUSS-SEIDEL iteration (cont d) wavefront ordering () diagonal ordering of updates all values along a diagonal line can be updated in parallel; single diagonal lines still have to be processed sequentially problem: for P = N processors there are P updates that need P sequential steps speed-up restricted to P/ P : P : P : P 4 : P 5 : P 6 : 7 7 JACOBI and GAUSS-SEIDEL Iterations parallelisation of GAUSS-SEIDEL iteration (cont d) wavefront ordering () better row-wise decomposition of matrix A in K blocks of N/K rows for P = N/K processors there are K sequential blocks of KP updates that need KP+P sequential steps each hence, speed-up restricted to KP/(K+) P : P : P : here, K = speed-up S(p) = P/ 7 JACOBI and GAUSS-SEIDEL Iterations parallelisation of GAUSS-SEIDEL iteration (cont d) red-black or checkerboard ordering grid points get a checkerboard colouring of red and black lexicographic order of visiting / updating the grid points first the red ones, than the black ones hence, no dependencies within red nor within black set subdivide grid such that each processor gets some red and some black points two sequential steps necessary, but perfect parallelism within each of them red-black ordering 5-point star for red (left) and black (right) grid points 7 9 Overview matrix operations JACOBI and GAUSS-SEIDEL iterations sorting 7 0 Sorting reminder: sorting one of the most common operations performed by computers let A = a, a,, a N be a sequence of N elements in arbitrary order sorting transforms A into a monotonically increasing or decreasing sequence Ã = ã, ã,, ã N such that ã i ã j ã i ã j for i j N (increasing order) for i j N (decreasing order) and Ã being a permutation of A in general, sorting algorithms are comparison-based, i. e. an algorithm sorts an unordered sequence of elements by repeatedly comparing / exchanging pairs of elements lower bound of the sequential complexity of any comparison-based algorithm is Ο(Nlog N) 7 Sorting basic operations in sequential / parallel sorting algorithms, some basic operations are repeatedly executed compare-exchange: elements a i and a j are compared and exchanged in case they are out of sequence compare-split: already sorted blocks of elements A i and A j stored at different processors P i and P j, resp., are merged and split in the following manner P i 7 P i P i 7 0 P i 0 5 P j P j P j 7 0 P j 7 0 P i sends block A i to P j and vice versa merge split 7 Sorting bubble sort simple comparison-based sorting algorithm of complexity Ο(N ) standard sequential algorithm for sorting sequence A for (i = N-; i ; --i) { for (j = 0; j i; ++j) { compare-exchange (a[j], a[j+]); } } example: iterations i =,, for sorting A =,,,, 5, 6, 4, initial setup st iteration nd iteration rd iteration Sorting bubble sort (cont d) standard algorithm not very suitable for parallelisation partition of A into blocks of size N/P elements (for P processors) still to be processed sequentially hence, different approach necessary: odd-even transposition idea: sorting N elements (N is even) in N phases, each of which requires N/ compare-exchange operations alternation between two phases, called odd and even phase during odd phase, only elements with odd indices are compareexchanged with their right neighbours, thus, the pairs (a, a ), (a, a 4 ), (a 5, a 6 ),, (a N-, a N ) during even phase, only elements with even indices are compareexchanged with their right neighbours, thus, the pairs (a, a ), (a 4, a 5 ), (a 6, a 7 ),, (a N-, a N- ) after N phases of odd-even-exchanges, sequence A is sorted 7 4 7 5 Sorting bubble sort (cont d) example: odd-even-transposition for sorting A from before phase (odd) phase (even) phase (odd) phase 4 (even) phase 5 (odd) phase 6 (even) phase 7 (odd) phase (even) 6 5 4 Sorting bubble sort (cont d) parallelisation of odd-even-transposition each process is assigned a block of N/P elements, which are sorted internally (using merge sort or quicksort, e. g.) with a complexity of Ο((N/P)log (N/P)) afterwards, each processor executes P phases (P/ odd and P/ even ones), performing compare-split operations at the end of these phases, sequence A is sorted (and distributed stored over P processes where process P i holds block A i with A i A j for i j) during each phase Ο(N/P) comparisons are performed, thus, the total complexity of the parallel sort can be computed as Ο((N/P)log (N/P)) + Ο(N) + communication local sort comparisons 7 6 Sorting merge sort comparison-based sorting algorithm of complexity Ο(Nlog N) based on the divide-and-conquer strategy basic idea: construct a sorted list by merging two sorted lists ) divide unsorted list into two sublists of about half the size ) recursively divide sublists until list size equals one ) merge the two sorted sublists back into one sorted list function mergesort (list L) { if (size of L == ) return L; else divide L into left and right list; left = mergesort (left); right = mergesort (right); result = merge (left, right); return result; } 7 7 Sorting merge sort (cont d) example: merge sort for sequence A =,, 5,,, 6, 4, split merge Sorting merge sort (cont d) parallelisation of merge sort: naïve approach construct a binary processing tree of L leaf nodes and assign P processors, P L, to tree nodes (i. e. inner and leaf nodes) divide sequence A into blocks A i of size N/L and store them at leaf nodes parallel sort of blocks A i via sequential merge sort with a complexity of Ο((N/L)log (N/L)) repeatedly parallel merge of sorted sublists from child nodes and sending result upstream to parent node with a total complexity of Ο(N) (costs of merge operation at the different tree levels are N + N/ + N/4 + +N/L comparisons) problem: sending of sublists might induce heavy communication hence, an appropriate mapping of processors to tree nodes is indispensable 7 9 Sorting merge sort (cont d) example: mapping of processors to tree nodes for P = L = merge merge with communication P P P 5 P P P 5 P 7 P P P P 4 P 5 P 6 P 7 P observations: not to expect very good speed-up values for parallel merge of sublists (amount of used processors is halved in every step part : parallel index and estimate of MINSKY) hence, different strategy for parallel merge necessary 7 40 Sorting sorting networks sorting networks are based on a comparison network model, that sort N elements in significantly smaller than Ο(Nlog N) operations key component of a sorting network: comparator device with two inputs a, a and two outputs ã, ã increasing comparator: ã = min{a, a } and ã = max{a, a } decreasing comparator: ã = max{a, a } and ã = min{a, a } sorting networks consist of several columns of such comparators, where each column performs a permutation, thus the final column is sorted in increasing / decreasing order ( permutation networks) ã = min{a, a } ã = max{a, a } a a a ã = max{a, a } a ã = min{a, a } 7 4 Sorting bitonic sort a bitonic sorting network sorts N elements in Ο(log N) operations key task: rearrangement of a bitonic sequence into a sorted one definition: bitonic sequence A sequence S = a 0, a,, a N is bitonic iff ) there exists an index i, 0 i N, such that a 0,, a i is monotonically increasing and a i+,, a N is monotonically decreasing, or ) there exists a cyclic shift of indices so that () is satisfied. example,, 4, 7, 6, 0 first increases and then decreases, 9,,, 0, 4 can be cyclic shifted to 0, 4,, 9,, 7 4 Sorting bitonic sort (cont d) let S = a 0, a,, a N be a bitonic sequence such that a 0 a a N/ and a N/ a N/+ a N consider the following subsequences of S S = min{a 0, a N/ }, min{a, a N/+ },, min{a N/, a N } S = max{a 0, a N/ }, max{a, a N/+ },, max{a N/, a N } in sequence S, there is an element s i = min{a i, a N/+i } such that all elements before s i are from the increasing part of S and all elements after s i are from the decreasing part of S also, in sequence S, there is an element ŝ i = max{a i, a N/+i } such that all elements before ŝ i are from the decreasing part of S and all elements after ŝ i are from the increasing part of S hence, sequences S and S are bitonic sequences 7 4 Sorting bitonic sort (cont d) furthermore, S S because s i is greater than or equal to all elements of S, ŝ i is less than or equal to all elements of S, and ŝ i is greater than or equal s i hence, the initial problem of rearranging a bitonic sequence of size N was reduced to that of rearranging two smaller bitonic sequences of size N/ and concatenating the results this operation is further referred to as bitonic split (although assuming S and S had increasing / decreasing sequences of the same length, the bitonic split operation holds for any bitonic sequence) the recursive usage of the bitonic split operation until all obtained subsequences are of size o
Recommended

71 pages

29 pages

71 pages

52 pages

13 pages

20 pages

8 pages

15 pages

49 pages

36 pages

16 pages

27 pages

5 pages

53 pages

### Parallel Programming and High-Performance Computing

74 pages

View more...

#### High Performance Computing. What is it used for and why? - PDF

We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks