Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Wrapup - Parallel Computing - Lecture Slides, Slides of Parallel Computing and Programming

Parallel Computing is emerging subject in filed of computer science. This course is designed to introduce architecture and basic concepts of parallel computing. This lecture includes: Wrapup, Parallel Computing Wrapup, Dissemination Barrier, Fourier Transform, Dissemination Barrier Algorithm, Dissemination Barrier in Action, Dissemination Pattern, 1D Discrete Fourier Transform, Butterfly Pattern of Computation, Spmd Implementation of Fft

Typology: Slides

2012/2013

Uploaded on 09/28/2013

dhanvant
dhanvant 🇮🇳

4.9

(9)

90 documents

1 / 28

Toggle sidebar

Related documents


Partial preview of the text

Download Wrapup - Parallel Computing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity! Parallel Computing Wrapup docsity.com 2 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com Dissemination Pattern • Strength —avoids reduction tree + broadcast tree – reduction + broadcast tree: critical path length = 2 log2 P —faster by a factor of two in practice for that reason • Also useful for idempotent reductions —barrier —min, max —or, and 5 docsity.com 6 1D Discrete Fourier Transform • Discrete Fourier transform on complex numbers Figure credit: http://upload.wikimedia.org/wikipedia/commons/5/54/FFT_Butterfly_radix8.svg Figure License: http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License • Widely employed in signal processing, solving PDEs, multiplying large integers, ... • Fast Fourier Transform (FFT): O(N log N) implementation docsity.com 7 Productive Parallel 1D FFT (n = 2k) subroutine fft(c, n) implicit complex(c) dimension c(0:n-1), irev(0:n-1) !HPF$ processors p(number_of_processors()) !HPF$ template t(0:n-1) !HPF$ align c(i) with t(i) !HPF$ align irev(i) with t(i) !HPF$ distribute t(block) onto p two_pi = 2.0d0 * acos(-1.0d0) levels = number_of_bits(n) - 1 irev = (/ (bitreverse(i,levels), i= 0, n-1) /) forall (i=0:n-1) c(i) = c(irev(i)) do l = 1, levels ! --- for each level in the FFT m = ishft(1, l) m2 = ishft(1, l - 1) do k = 0, n - 1, m ! --- for each butterfly in a level do j = k, k + m2 - 1 ! --- for each point in a half bfly ce = exp(cmplx(0.0,(j - k) * -two_pi/real(m))) cr = ce * c(j + m2) cl = c(j) c(j) = cl + cr c(j + m2) = cl - cr end do end do enddo end subroutine fft Two stages: —bit reversal permutation —butterfly pattern of element combinations docsity.com • Radix 2 FFT implementation • Block distribution of coarray “c” across all processors • Sketch in CAF 2.0: complex, allocatable :: c(:,2)[*], spare(:)[*] ... ! permute data to bit-reversed indices (uses team_alltoall) call bitreverse(c, n_world_size, world_size, spare) ! local FFT computation for levels that fit in the memory of an image do l = 1, loc_comm-1 ... ! transpose from block to cyclic data distribution (uses team_alltoall) call transpose(c, n_world_size, world_size, spare) ! local FFT computation for remaining levels do l = loc_comm, levels ... ! transpose back from cyclic to block data distribution (uses team_alltoall) call transpose(c, n_world_size, n_local_size/world_size,spare) SPMD Implementation of FFT in CAF 2.0 10 docsity.com SPMD Element-wise Bitreverse in CAF Each processor performs reversal of a block [ rank * n / npe ... (rank + 1) * n/npe -1] local_n = n / npe global_j = rank * local_n do local_j = 0, local_n - 1 jrev = bitreverse(global_j, n_bits) processor = ishft(jrev, -local_n_bits) index = iand(jrev, local_mask) crev(index)[processor] = c(local_j) global_j = global_j + 1 enddo 11 log n - log p bits log P bits docsity.com All-to-All Personalized Communication : bh Py : a ? R “a * - n P, é # ¥ & 12 docsity.com 15 Topics for Today • Dissemination barrier • Fourier transform • Review • Some things we didn’t cover • Parallel programming top ten list docsity.com 16 Course Objectives • Learn fundamentals of parallel computing —principles of parallel algorithm design —programming models and methods —parallel computer architectures —modeling and analysis of parallel programs and systems —parallel algorithms • Develop skill writing parallel programs —programming assignments employing a variety of models • Develop skill analyzing parallel computing problems —develop parallelizations for different styles of computations docsity.com 17 Principles of Parallel Algorithm Design • Algorithm models —data-parallel, task graph, work pool —master slave, pipeline, hybrid • Decomposition techniques —recursive —data driven: input data, output data, intermediate data —hybrid decomposition —exploratory decomposition —speculative decomposition • Task generation —static vs. dynamic assignment 4 assignment 2 assignment 3 assignment 1 4 1,4 2,3 assignment 1 docsity.com 20 Parallel Architectures • Control structure and communication models —control structure: SIMD, MIMD —communication models – shared address space – message passing platforms • Network topologies —static/direct vs. dynamic/indirect networks —bus, crossbar, omega, hypercube, fat tree, mesh, Kautz graph —hybrid interconnects —evaluation metrics – degree, diameter, bisection width, channel width & rate, cost • Coherence, routing, and network embeddings —blocking vs. non-blocking networks —routing techniques: store & forward, packet, wormhole —cache coherence: protocols, snoopy caches, directories, SCI —embeddings: dilation, congestion docsity.com 21 Analytical Modeling of Parallel Systems • Overheads, —idling: serialization, load imbalance —data movement, resource sharing, extra computation • Metrics —time, total overhead, speedup, efficiency • Scalability —cost optimality —isoefficiency – how must problem grow as function of PEs to maintain efficiency —Amdahl's law and limits on speedup • Asymptotic analysis —algorithm complexity —scalability docsity.com 22 Synchronization • Insufficient synchronization causes data races —unordered, conflicting operations • Mutual exclusion: classical algorithms for locks —explore formal reasoning about concurrent operations • Lock synchronization with atomic primitives —practical algorithms for pairwise coordination • Barrier synchronization —separate phases to prevent overlap of conflicting operations —strategies for fast, primitive collective synchronization docsity.com 25 Reasoning about Concurrent Operations • Serializability —as if operations were executed in some sequential order • Sequential consistency —as if all operations were executed in some sequential order —operations of each processor appear in program order • Linearizability —each operation appears to take effect instantaneously —interleaving consistent with operation execution in time docsity.com 26 Active Areas of Research • Architectures —the quest for exascale performance: 1018 operations per second —multi-core processors – number of cores and threads is rapidly increasing – hardware support for simplifying parallelism —heterogeneity is already here; more is coming • Compilers for parallel systems • Operating systems for scalable architectures —lightweight microkernels: ZeptoOS, Compute Node Linux • Multiscale algorithms: regular and irregular variants • Transactional memory —first transactional hardware delivered on Blue Gene Q —coming in 2013 on Intel Haswell docsity.com 27 Top Ten Tips for Parallel Computing It’s all about the performance • Use an efficient algorithm —clever implementation will yield to asymptotic inefficiency at scale • Partition your data and computation carefully —the wrong data partitioning can yield high communication volume —the wrong computation partitioning can lead to load imbalance – work stealing can help • Choose your programming model judiciously —shared-memory models make irregular problems easier • Avoid serialization —efficiency requires all processors to be computing —may require changes to algorithm and partitioning of data & computation • Choose the proper grain size for computation —wrong grain size can lead to excessive communication frequency docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved