Download Concurrency Mapping - Parallel Computing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity! Principles of Parallel Algorithm Design: Concurrency and Mapping docsity.com 2 • Introduction to parallel algorithms —tasks and decomposition —processes and mapping —processes versus processors • Decomposition techniques - part 1 —recursive decomposition —data decomposition docsity.com 5 Exploratory Decomposition Example Solving a 15 puzzle • Sequence of three moves from state (a) to final state (d) • From an arbitrary state, must search for a solution docsity.com 6 Exploratory Decomposition: Example Solving a 15 puzzle Search —generate successor states of the current state —explore each as an independent task initial state final state (solution) after first move docsity.com 7 Exploratory Decomposition Speedup • Parallel formulation may perform a different amount of work • Can cause super- or sub-linear speedup m m m m m m m m total serial work = 2m + 1 total parallel work = 4 total serial work = m total parallel work = 4m solution docsity.com 10 Hybrid Decomposition Use multiple decomposition strategies together Often necessary for adequate concurrency • Quicksort —recursive decomposition alone limits concurrency (why?) —augmenting recursive with data decomposition is better – can use data decomposition on input data to compute a split • Discrete event simulation —data parallelism may be possible when processing a task docsity.com 11 Topics for Today • Decomposition techniques - part 2 —exploratory decomposition —hybrid decomposition • Characteristics of tasks and interactions —task generation, granularity, and context —characteristics of task interactions • Mapping techniques for load balancing —static mappings —dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛ docsity.com 12 Characteristics of Tasks • Key characteristics —generation strategy —associated work —associated data size • Impact choice and performance of parallel algorithms docsity.com 15 Size of Data Associated with Tasks • Data may be small or large compared to the computation —size(input) < size(computation), e.g., 15 puzzle —size(input) = size(computation) > size(output), e.g., min —size(input) = size(output) < size(computation), e.g., sort • Implications —small data: task can easily migrate to another process —large data: ties the task to a process – possibly can avoid communicating the task context reconstruct/recompute the context elsewhere docsity.com 16 Characteristics of Task Interactions Orthogonal classification criteria • Static vs. dynamic • Regular vs. irregular • Read-only vs. read-write • One-sided vs. two-sided docsity.com 17 Characteristics of Task Interactions • Static interactions —tasks and interactions are known a-priori —simpler to code • Dynamic interactions —timing or interacting tasks cannot be determined a-priori —harder to code – especially using two-sided message passing APIs docsity.com 20 Static Irregular Task Interaction Pattern Sparse matrix-vector multiply docsity.com 21 Characteristics of Task Interactions • Read-only interactions —tasks only read data associated with other tasks • Read-write interactions —read and modify data associated with other tasks —harder to code: requires synchronization – need to avoid read-write and write-write ordering races docsity.com 22 Characteristics of Task Interactions • One-sided —initiated & completed independently by 1 of 2 interacting tasks – GET – PUT • Two-sided —both tasks coordinate in an interaction – SEND + RECV docsity.com 25 Mapping Techniques for Minimum Idling • Must simultaneously minimize idling and load balance • Balancing load alone does not minimize idling Time Time docsity.com 26 Mapping Techniques for Minimum Idling Static vs. dynamic mappings • Static mapping —a-priori mapping of tasks to processes —requirements – a good estimate of task size – even so, optimal mapping may be NP complete e.g., multiple knapsack problem • Dynamic mapping —map tasks to processes at runtime —why? – tasks are generated at runtime, or – their sizes are unknown Factors that influence choice of mapping • size of data associated with a task • nature of underlying domain docsity.com 27 Schemes for Static Mapping • Data partitionings • Task graph partitionings • Hybrid strategies docsity.com 30 Block Array Distribution Example Multiplying two dense matrices C = A x B • Partition the output matrix C using a block decomposition • Give each task the same number of elements of C —each element of C corresponds to a dot product —even load balance • Obvious choices: 1D or 2D decomposition • Select to minimize associated communication overhead docsity.com
Data Usage in Dense Matrix Multiplication
A B c
(a)
A = -
Po PP, P, P3
es P| Ps| Pe| P-
L________________- x 7 Ps| Po} Pio Pir
Pi Piz) Pia) Pris
(b)
1
docsity.com
32 Consider: Gaussian Elimination Active submatrix shrinks as elimination progresses A[k,j] docsity.com 35 Block-Cyclic Distribution (a) 1D block-cyclic (b) 2D block-cyclic • Cyclic distribution: special case with block size = 1 • Block distribution: special case with block size is n/p, —n is the dimension of the matrix; p is the # of processes docsity.com 36 Decomposition by Graph Partitioning Sparse-matrix vector multiply • Graph of the matrix is useful for decomposition —work ~ number of edges —communication for a node ~ node degree • Goal: balance work & minimize communication • Partition the graph —assign equal number of nodes to each process —minimize edge count of the graph partition docsity.com 37 Partitioning a Graph of Lake Superior Random Partitioning Partitioning for minimum edge-cut docsity.com 40 Mapping a Sparse Graph Sparse matrix-vector product sparse matrix structure 17 items to communicate partitioning mapping docsity.com 41 Mapping a Sparse Graph Sparse matrix-vector product mapping 13 items to communicate partitioning sparse matrix structure 17 items to communicate docsity.com 42 Hierarchical Mappings • Sometimes a single mapping is inadequate —e.g., task mapping of quicksort binary tree cannot readily use a large number of processors. • Hierarchical approach —use a task mapping at the top level —data partitioning within each task docsity.com 45 Centralized Dynamic Mapping • Processes = masters or slaves • General strategy —when a slave runs out of work → request more from master • Challenge —master may become bottleneck for large # of processes • Approach —chunk scheduling: process picks up several of tasks at once —however – large chunk sizes may cause significant load imbalances – gradually decrease chunk size as the computation progresses docsity.com 46 Distributed Dynamic Mapping • All processes as peers • Each process can send or receive work from other processes —avoids centralized bottleneck • Four critical design questions —how are sending and receiving processes paired together? —who initiates work transfer? —how much work is transferred? —when is a transfer triggered? • Ideal answers can be application specific • Cilk uses a distributed dynamic mapping: “work stealing” docsity.com 47 Topics for Today • Decomposition techniques - part 2 —exploratory decomposition —hybrid decomposition • Characteristics of tasks and interactions —task generation, granularity, and context —characteristics of task interactions • Mapping techniques for load balancing —static mappings —dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛ docsity.com 50 Topics for Today • Decomposition techniques - part 2 —exploratory decomposition —hybrid decomposition • Characteristics of tasks and interactions —task generation, granularity, and context —characteristics of task interactions • Mapping techniques for load balancing —static mappings —dynamic mappings • Methods for minimizing interaction overheads • Parallel algorithm design templates ☛ docsity.com 51 Parallel Algorithm Model • Definition: ways of structuring a parallel algorithm • Aspects of a model —decomposition —mapping technique —strategy to minimize interactions docsity.com 52 Common Parallel Algorithm Models • Data parallel —each task performs similar operations on different data —typically statically map tasks to processes • Task graph —use task dependency graph relationships to – promote locality, or reduce interaction costs • Master-slave —one or more master processes generate work —allocate it to worker processes —allocation may be static or dynamic • Pipeline / producer-consumer —pass a stream of data through a sequence of processes —each performs some operation on it • Hybrid —apply multiple models hierarchically, or —apply multiple models in sequence to different phases docsity.com