Download Wrapup - Parallel Computing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity! Parallel Computing Wrapup docsity.com 2 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com Dissemination Pattern • Strength —avoids reduction tree + broadcast tree – reduction + broadcast tree: critical path length = 2 log2 P —faster by a factor of two in practice for that reason • Also useful for idempotent reductions —barrier —min, max —or, and 5 docsity.com 6 1D Discrete Fourier Transform • Discrete Fourier transform on complex numbers Figure credit: http://upload.wikimedia.org/wikipedia/commons/5/54/FFT_Butterfly_radix8.svg Figure License: http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License • Widely employed in signal processing, solving PDEs, multiplying large integers, ... • Fast Fourier Transform (FFT): O(N log N) implementation docsity.com 7 Productive Parallel 1D FFT (n = 2k) subroutine fft(c, n) implicit complex(c) dimension c(0:n-1), irev(0:n-1) !HPF$ processors p(number_of_processors()) !HPF$ template t(0:n-1) !HPF$ align c(i) with t(i) !HPF$ align irev(i) with t(i) !HPF$ distribute t(block) onto p two_pi = 2.0d0 * acos(-1.0d0) levels = number_of_bits(n) - 1 irev = (/ (bitreverse(i,levels), i= 0, n-1) /) forall (i=0:n-1) c(i) = c(irev(i)) do l = 1, levels ! --- for each level in the FFT m = ishft(1, l) m2 = ishft(1, l - 1) do k = 0, n - 1, m ! --- for each butterfly in a level do j = k, k + m2 - 1 ! --- for each point in a half bfly ce = exp(cmplx(0.0,(j - k) * -two_pi/real(m))) cr = ce * c(j + m2) cl = c(j) c(j) = cl + cr c(j + m2) = cl - cr end do end do enddo end subroutine fft Two stages: —bit reversal permutation —butterfly pattern of element combinations docsity.com • Radix 2 FFT implementation • Block distribution of coarray “c” across all processors • Sketch in CAF 2.0: complex, allocatable :: c(:,2)[*], spare(:)[*] ... ! permute data to bit-reversed indices (uses team_alltoall) call bitreverse(c, n_world_size, world_size, spare) ! local FFT computation for levels that fit in the memory of an image do l = 1, loc_comm-1 ... ! transpose from block to cyclic data distribution (uses team_alltoall) call transpose(c, n_world_size, world_size, spare) ! local FFT computation for remaining levels do l = loc_comm, levels ... ! transpose back from cyclic to block data distribution (uses team_alltoall) call transpose(c, n_world_size, n_local_size/world_size,spare) SPMD Implementation of FFT in CAF 2.0 10 docsity.com SPMD Element-wise Bitreverse in CAF Each processor performs reversal of a block [ rank * n / npe ... (rank + 1) * n/npe -1] local_n = n / npe global_j = rank * local_n do local_j = 0, local_n - 1 jrev = bitreverse(global_j, n_bits) processor = ishft(jrev, -local_n_bits) index = iand(jrev, local_mask) crev(index)[processor] = c(local_j) global_j = global_j + 1 enddo 11 log n - log p bits log P bits docsity.com
All-to-All Personalized Communication
: bh
Py : a ?
R “a *
- n
P, é #
¥
& 12
docsity.com
15 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com 16 Course Objectives • Learn fundamentals of parallel computing —principles of parallel algorithm design —programming models and methods —parallel computer architectures —modeling and analysis of parallel programs and systems —parallel algorithms • Develop skill writing parallel programs —programming assignments employing a variety of models • Develop skill analyzing parallel computing problems —develop parallelizations for different styles of computations docsity.com 17 Principles of Parallel Algorithm Design • Algorithm models —data-parallel, task graph, work pool —master slave, pipeline, hybrid • Decomposition techniques —recursive —data driven: input data, output data, intermediate data —hybrid decomposition —exploratory decomposition —speculative decomposition • Task generation —static vs. dynamic assignment 4 assignment 2 assignment 3 assignment 1 4 1,4 2,3 assignment 1 docsity.com 20 Parallel Architectures • Control structure and communication models —control structure: SIMD, MIMD —communication models – shared address space – message passing platforms • Network topologies —static/direct vs. dynamic/indirect networks —bus, crossbar, omega, hypercube, fat tree, mesh, Kautz graph —hybrid interconnects —evaluation metrics – degree, diameter, bisection width, channel width & rate, cost • Coherence, routing, and network embeddings —blocking vs. non-blocking networks —routing techniques: store & forward, packet, wormhole —cache coherence: protocols, snoopy caches, directories, SCI —embeddings: dilation, congestion docsity.com 21 Analytical Modeling of Parallel Systems • Overheads, —idling: serialization, load imbalance —data movement, resource sharing, extra computation • Metrics —time, total overhead, speedup, efficiency • Scalability —cost optimality —isoefficiency – how must problem grow as function of PEs to maintain efficiency —Amdahl's law and limits on speedup • Asymptotic analysis —algorithm complexity —scalability docsity.com 22 Synchronization • Insufficient synchronization causes data races —unordered, conflicting operations • Mutual exclusion: classical algorithms for locks —explore formal reasoning about concurrent operations • Lock synchronization with atomic primitives —practical algorithms for pairwise coordination • Barrier synchronization —separate phases to prevent overlap of conflicting operations —strategies for fast, primitive collective synchronization docsity.com 25 Reasoning about Concurrent Operations • Serializability —as if operations were executed in some sequential order • Sequential consistency —as if all operations were executed in some sequential order —operations of each processor appear in program order • Linearizability —each operation appears to take effect instantaneously —interleaving consistent with operation execution in time docsity.com 26 Active Areas of Research • Architectures —the quest for exascale performance: 1018 operations per second —multi-core processors – number of cores and threads is rapidly increasing – hardware support for simplifying parallelism —heterogeneity is already here; more is coming • Compilers for parallel systems • Operating systems for scalable architectures —lightweight microkernels: ZeptoOS, Compute Node Linux • Multiscale algorithms: regular and irregular variants • Transactional memory —first transactional hardware delivered on Blue Gene Q —coming in 2013 on Intel Haswell docsity.com 27 Top Ten Tips for Parallel Computing It’s all about the performance • Use an efficient algorithm —clever implementation will yield to asymptotic inefficiency at scale • Partition your data and computation carefully —the wrong data partitioning can yield high communication volume —the wrong computation partitioning can lead to load imbalance – work stealing can help • Choose your programming model judiciously —shared-memory models make irregular problems easier • Avoid serialization —efficiency requires all processors to be computing —may require changes to algorithm and partitioning of data & computation • Choose the proper grain size for computation —wrong grain size can lead to excessive communication frequency docsity.com